Symptom
Statisticserver alerts are triggered on HANA side:
HANA alert 1: Host physical memory usage
HANA alert 43: Memory usage of services
HANA alert 44: Licensed memory usage
HANA alert 1: Host physical memory usage
HANA alert 43: Memory usage of services
HANA alert 44: Licensed memory usage
Environment
HANA
Resolution
1. Operating System vs. Database
Before proceeding with analysing the different HANA specific memory
areas, it has to be clarified that indeed the database and not processes
running outside the SAP Software are responsible. This involves
calculating how much resident memory is used by the HANA database. On
the HANA appliance, resident memory used outside HANA (OS, 3rd party
processes) is typically quite small, not exceeding 2 GB. HANA Studio and DBACockpit provide 2 different views on the resident memory:HANA Studio:
The following SQL statements are behind this output:
'Database Resident':
SELECT SUM(PHYSICAL_MEMORY_SIZE) FROM M_SERVICE_MEMORY'Total Resident':
SELECT T1.HOST, T1.USED_PHYSICAL_MEMORY + T2.SHARED_MEMORY_ALLOCATED_SIZE FROM M_HOST_RESOURCE_UTILIZATION AS T1 JOIN (SELECT M_SERVICE_MEMORY.HOST, SUM(M_SERVICE_MEMORY.SHARED_MEMORY_ALLOCATED_SIZE) AS SHARED_MEMORY_ALLOCATED_SIZE FROM SYS.M_SERVICE_MEMORY GROUP BY M_SERVICE_MEMORY.HOST) AS T2 ON T2.HOST = T1.HOST;Since the difference between 'Total Resident' and 'Database Resident' is well below 2 GB, there is no indication that processes outside the database contribute significantly to memory consumption.
DBACockpit:
This is effectively the output of the following statement:
SELECT HOST, ROUND(USED_PHYSICAL_MEMORY/1024/1024/1024, 2) AS "Resident GB", ROUND((USED_PHYSICAL_MEMORY + FREE_PHYSICAL_MEMORY)/1024/1024/1024, 2) AS "Physical Memory GB" FROM PUBLIC.M_HOST_RESOURCE_UTILIZATIONDBACockpit does currently not consider shared memory as part of the resident memory, hence the difference to what HANA Studio reports.
If processes outside the database are not contributing significantly to memory usage, 2 cases have to be distinguished in general:
- An 'out of memory' error has already occurred.
- An 'out of memory' error was not yet triggered, but further investigation on the current memory usage of the database might be required.
In case (2), such a trace file is not yet available and an rte dump has to be generated which has a structure similar to the oom trace. SAP note 1813020 contains the necessary instructions regarding how to get the rte dump. For case (2), further analysis can also be done 'top-down', making use of the information provided by the different administrative frontends (HANA Studio / DBACockpit) and monitoring views.
2. Top down approach
2.1 Which HANA process on which host is using most of the memory?
SQL Statement:SELECT TOP 3 HOST, PORT, SERVICE_NAME, TOTAL_MEMORY_USED_SIZE FROM M_SERVICE_MEMORY ORDER BY TOTAL_MEMORY_USED_SIZE DESC
DBACockpit:
HANA Studio:
2.1.1 On 'Used Memory'
HANA SP5
The values in the 'Used Memory' column are calculated in the following way:SELECT ( (CODE_SIZE + SHARED_MEMORY_ALLOCATED_SIZE + HEAP_MEMORY_USED_SIZE ) / 1024 / 1024) AS "USED MEMORY" FROM M_SERVICE_MEMORYIt is important to note that even though the column is called 'Used Memory', the overall allocated shared memory is used for the calculation and not the value of SHARED_MEMORY_USED_SIZE. The reason for this is that the a shared memory segment (see point 4) can only be completely attached to the address space of the process and not just the used part.
Due to changes in the calculation of CODE_SIZE, there might be cases where the used memory increased after an revision upgrade. This can be seen in the following example:
SELECT SERVICE_NAME, CODE_SIZE FROM M_SERVICE_MEMORYrev49:
SERVICE_NAME | CODE_SIZE |
nameserver | 627.855.360 |
preprocessor | 687.079.424 |
indexserver | 733.585.408 |
statisticsserver | 727.781.376 |
xsengine | 725.540.864 |
rev54 and later:
SERVICE_NAME | CODE_SIZE |
nameserver | 4.498.731.008 |
preprocessor | 4.526.575.616 |
indexserver | 5.256.380.416 |
statisticsserver | 4.956.565.504 |
xsengine | 5.230.637.056 |
HANA SP6
Due to changes in the calculation of the code size (see above) the calculation for 'Used Memory' was changed. It can now be retrieved with the following statement:SELECT INSTANCE_TOTAL_MEMORY_USED_SIZE FROM sys.m_host_resource_utilizationStarting with SP6, instace_total_memory_used_size is exposed via the view m_host_resource_utilization.
2.2 Is shared or heap memory using the largest share?
So far, the sum of used heap and shared memory has been determined. Now, this total has to be split:SQL Statement:
SELECT TOP 3 HOST, PORT, SERVICE_NAME, HEAP_MEMORY_USED_SIZE, SHARED_MEMORY_USED_SIZE, TOTAL_MEMORY_USED_SIZE FROM M_SERVICE_MEMORY ORDER BY SHARED_MEMORY_USED_SIZE DESC
If it is shared memory, proceed as outlined in section 'Shared Memory Usage', otherwise, go to section 'Heap memory usage'.
3. Trace file approach
The following sections of those trace files are usually relevant:[IPMM_MEMORY]
The first part of this section lists the local (heap) memory the processes that make up the HANA database are currently using:
[0] PID=34884, SId=42309944, compactors active, alive, process name: hdbnameserver AB=2220466176b (2.06gb), UA=0b, U=2015851859b (1.87gb), FSL=0b, PAL=487738426982b (454.24gb), TPAL=2220466176b (2.06gb), W=100 [1] PID=35049, SId=42310545, compactors active, alive, process name: hdbpreprocessor AB=365772800b (348.82mb), UA=0b, U=362430594b (345.64mb), FSL=0b, PAL=487738426982b (454.24gb), TPAL=365772800b (348.82mb), W=100 [3] PID=35094, SId=42310846, compactors active, alive, process name: hdbstatisticsse AB=17623138051b (16.41gb), UA=0b, U=14624613181b (13.62gb), FSL=268435456b (256mb), PAL=27096579276b (25.23gb), TPAL=17757355779b (16.53gb), W=100 [4] PID=35114, SId=42310947, compactors active, alive, process name: hdbxsengine AB=2270855168b (2.11gb), UA=0b, U=2136436039b (1.98gb), FSL=0b, PAL=487738426982b (454.24gb), TPAL=2270855168b (2.11gb), W=100 [5] PID=33976, SId=171197412, compactors active, alive, process name: hdbindexserver AB=240495694077b (223.97gb), UA=0b, U=260528715346b (242.63gb), FSL=0b, PAL=487738426982b (454.24gb), TPAL=270062153728b (251.51gb), W=100For each process, the allocation limit (PAL), the amount of memory currently allocated (AB) and used (U) are displayed. An AB value that is significantly larger than an U value does not necessarily mean a problem: If necessary, allocated, but not used memory will be released. Of particular interest are lines where AB, U and PAL have approximately the same value. This particular process has then reached its allocation limit. Further analysis based on the output of M_HEAP_MEMORY in section [PROCESS_INFO] has to be done then.
The second part of section [IPMM_MEMORY] contains information regarding the shared memory:
GLOBAL_MAX_ALLOCATION_LIMIT=487738426983b (454.24gb), cached sum of allocation limits=487604209255b (454.11gb) #checks for cached allocation limit sum=838859, #terminated processes since last reset=7 #processes=5, sum of temp process allocation limits=292676603651b (272.57gb), cached sum=487604209255b (454.11gb), weight sum=500, cached weight sum=500 SHARED_MEMORY=194927605604b (181.54gb), temp shared memory allocation limit=194927605604b (181.54gb) IPMM reservation=0b, emergency reservation=134217728b (128mb) 112 early exits since creation of IPMM, provide-memory-counter=50279 Provide memory 50267 failed, only 109174531b were free. Users respected in shm calculation: [302]Shared Memory is used for various purposes. For practical purposes, it is sufficient to assume that mainly row store tables occupy shared memory. In case of the 'early exits since' output in the oom trace, the error was triggered by the inability to further increase the TPAL (temporary allocation limit) of the failing process. The allocation limit of a process is not set to the maximum right at process start, but increased (and decreased) over time. In this specific case, 'sum of temp process allocation limits' + SHARED_MEMORY exceeds the GLOBAL_MAX_ALLOCATION_LIMIT which consequently causes the error.
[MEMORY_OOM]
Before any detailed analysis is done, the information from section [MEMORY_OOM] has to be reviewed as well. It might be the case that the request was that large that it was anyway impossible to fulfill it:
[MEMORY_OOM] Information about current out of memory situation: OUT OF MEMORY occurred. Failed to allocate 2405843009873693952 byte.A known issue regarding such large request is solved with revision 50. In case this occurs in newer revisions, please contact SAP Support.
The following checks can then be done based on the information available so far:
4. Shared Memory Usage
Contrary to heap memory, which is allocated using the 'malloc' system call, shared memory is provided using the 'shmget' call. The results of shared memory creation can then be viewed on OS level using the ipcs command. The following approach can be used to display the shared memory of one particular HANA process on operating system side:- Get the process pid: ps -ef | grep <HANA process>
- ipcs -p | grep <pid> then displays all segments that were created by this particular process:
ipcs -p | grep 4221 86999065 hanadm 4221 4221 87064602 hanadm 4221 4221 87130139 hanadm 4221 4221 87195676 hanadm 4221 4221 87261213 hanadm 4221 4221 87359519 hanadm 4221 4221 88309819 hanadm 4221 4221 88375356 hanadm 4221 4221 88440894 hanadm 4221 4221
- The size of a particular segment can then be further examined using the command ipcs -m -i <id>
ipcs -m -i 86999065 Shared memory Segment shmid=86999065 uid=60000 gid=100 cuid=60000 cgid=100 mode=01600 access_perms=0600 bytes=8929752 lpid=4221 cpid=4221 nattch=1 att_time=Tue May 14 14:09:13 2013 det_time=Tue May 14 14:09:13 2013 change_time=Tue May 14 14:09:13 2013The sum of all those shared memory segments is then equivalent to the output of the statement:
SELECT SHARED_MEMORY_ALLOCATED_SIZE FROM M_SERVICE_MEMORY WHERE PROCESS_ID = '4221'
4.1 Memory usage of the row store
- The row store is organized in 64MB segments. Deleting large number of rows can lead to sparse segments and unnecessary use of memory. SAP note 1813245 outlines problems and strategies to solve this problem.
- Indirectly, high memory usage of the row store can cause problems when parameter client_distribution_mode is set to 'off' in distributed environments. When default value 'statement' is used, the statement is sent to the node where the table is located. With the setting 'off', the statement might then also be directed to the master node. Since all row store tables are usually located on the master node and consequently, the row store uses a lot of memory, materializing a large amount of data on the master node (from a table that is actually located on another node) can then simply be too much. Therefore, in case of problems with resolving internal and external hostnames, parameter client_distribution_mode should not be turned off, but instead the recommendations from SAP note 1780950 are to be followed.
- Too many large tables were created as row store tables. The largest tables currently contained in the row store can be retrieved by the following SQL Statement:
SELECT TOP 50 * FROM M_RS_TABLES ORDER BY (ALLOCATED_FIXED_PART_SIZE + ALLOCATED_VARIABLE_PART_SIZE) DESC
It is however important to keep in mind that in many scenarios, switching a table from rowstore to columnstore and vice versa must not be done. This applies to Suite on HANA / BW systems which are delivered with particular settings for different table types. At any case, it makes sense to check whether the system is currently adhering to the recommended configuration or not. SAP note 1659383 contains a list of tables that are supposed to be in the rowstore and a check that is based on this list is part of report RSDU_TABLE_CONSISTENCY. This report also contains a repair option when a deviation is recognized.
5. Heap memory usage
If it can be ruled out that the rowstore is responsible for high memory consumption, the heap memory allocation of the individual processes has to be analyzed next. In most cases, the process of interest is the indexserver, since it usually allocates the greatest share of the available memory. The following possibilities exist to do this:- If an oom error has already occurred or a rte dump was explicitly triggered, the content of the system view M_HEAP_MEMORY (section[PROCESS_INFO]) that was written into the trace file can be used.
- If an ad-hoc analysis of the heap memory consumption should be done, the view M_HEAP_MEMORY can be queried directly, for example using HANA Studio.
SELECT TOP 15 MS.HOST, MS.SERVICE_NAME,MH.CATEGORY, MH.INCLUSIVE_SIZE_IN_USE, MH.EXCLUSIVE_SIZE_IN_USE FROM M_HEAP_MEMORY MH, M_SERVICES MS WHERE MH.PORT = MS.PORT AND MH.HOST = MS.HOST AND MS.SERVICE_NAME = 'indexserver' ORDER BY 4 DESCOf interest are those pools that are on the top of the list and have an inclusive size close to the exclusive size. Depending on the pool names, the following known issues exist:
Pool/parallel/ihm
There are cases where this allocator used 90% of the available memory. If such a problem is observed, the system should be upgraded to revision 54 when available.
Pool/itab
This allocator is needed for the intermediate result of join operation and translation table used for join operation. This could indicate use of a suboptimal model.
Pool/FemsCompression/CompositeFemsCompressionPool/FemsCompression/CompositeFemsCompression
A known issue is associated with high memory usage that is solved with revision 53. FEMS compression is used in BW systems to keep the size of the dataset that is transferred to the application server as small as possible.
Pool/ValueArray
This is the allocator which keeps the complete resultset in uncompressed form. A known issue has been solved with revision 55 which addresses memory used by this allocator not released after oom situations, thus steadily increasing over time. The details are documented in SAP note 1852425.
Pool/JoinEvaluator/TranslationTable
Translation tables are created for caching join column entries of join engine queries. The number of translation tables that are kept in memory is determined by the the following parameter in indexserver.ini:[joins] translator_cache_size = nThe default value is 2000. In order to verify how many translation tables currently exist in the system, the following statement can be used:
select * from sys.M_DEV_JOIN_TRANSLATION_TABLES_
order by TRANSLATION_TABLE_MEMORY_SIZE desc
This statement sums up the memory usage of those tables:
select host, port, transtable_memory from sys.m_dev_joinengine_
In case of excessive memory usage for this allocator, the parameter can be set to 500 as a starting point. Lowering the value can have an impact on query performance though. However, in live systems where this was done, no negative side effects were observed sofar.
Pool/malloc/libhdbcsapi.so
A memory leak was introduced with revision 54 that can lead to a huge growth of this allocator when BW queries containing lots of OR terms instead of IN lists are executed. A possible workaround from application side is the implementation of SAP note 1786777. From HANA side, the problem is solved starting with revision 60 (HANA SP6). See also SAP note 1880274.
6. Memory Usage of the Statisticsserver
Even though the statisticsserver can also run into oom errors, it should not consume a large share of the available memory. The statisticsserver is supposed to have a much smaller allocation limit compared to the indexserver. This following two configuration parameters are relevant:statisticsserver.ini -> memorymanager allocationlimit minallocationlimitThe allocation limit of the statisticsserver is either a percentage of the availably RAM or just the value of minallocationlimit, depending on what is the bigger value. In case of oom errors that indicate that the statisticsserver has reached its process allocation limit, it can make sense to increase the allocationlimit to 10% or 15% to have at least a workaround for the time being. However, further analysis regarding the heap memory usage has to be conducted (see above). In the oom trace file, such a situation would look like this:
[3] PID=23840, SId=320543005, compactors active, alive, process name: hdbstatisticsse AB=6755876864b (6.29gb), U=7135415505b (6.64gb), FSL=0b, PAL=6772669235b (6.30gb), TPAL=6772669235b (6.30gb), W=100AB, PAL and TPAL are on the same values. Also in this case, further investigation regarding the involved pools is necessary, either using the information contained in the trace files or by using the following SQL statement:
SELECT TOP 15 MS.HOST, MS.SERVICE_NAME,MH.CATEGORY, MH.INCLUSIVE_SIZE_IN_USE, MH.EXCLUSIVE_SIZE_IN_USE FROM M_HEAP_MEMORY MH, M_SERVICES MS WHERE MH.PORT = MS.PORT AND MH.HOST = MS.HOST AND MS.SERVICE_NAME = 'statisticsserver' ORDER BY 4 DESC
No comments:
Post a Comment