Tuesday, February 2, 2016

1858357 - HANA DB: Analysis of wait situations

 Symptom
Lock situations of a HANA database system are characterized by the following symptoms:
  • The HANA Studio Alert monitor shows alerts 49 ("Check blocked transactions") or 59 ("Percentage of transactions blocked").
  • The indexserver trace contains one or more of the following errors "Lock timeout occurs while waiting TABLE_LOCK/RECORD_LOCK of mode EXCLUSIVE", "long running uncommitted write transaction detected", or "There are too many un-collected versions. The transaction blocks the garbage collection of HANA database.", "Deadlock detected while executing transaction".
  • Execution of specific SQL statements or system task seem to be stuck and do not process as expected.

Other Terms
deadlock lock write transaction resource long running uncommitted monitor alert blocked garbage collection mvcc_anti_ager

Reason and Prerequisites
  • Transactional Database Locks block write transactions because they cannot acquire the transactional database locks that are held by other write transactions (typically locks of type "exclusive" or "intentionally exclusive"). The HANA database system has automatic mechanisms to detect long running write transactions and deadlock situations and take action to cancel user sessions holding a transactional lock for too long (see the indexserver parameter lock_wait_timeout which defaults to 1800000ms) and resolve the deadlock situations after a predetermined length of time has elapsed (mvcc_anti_ager).
  • Internal locks are used by the database engine and the operating system to control access to resources via concurrent threads. The implementation of transactional database locks and other database functions utilizes internal locks. When threads try to acquire internal locks being held by other threads, lock contention occurs. Increasing lock contention may deteriorate the performance of the database system.

Solution
For the purpose of this note, we will cover 2 locking situations; the first being the case where a lock is presently occurring (a "Present Lock Situation"), and the second being the case where a lock has occurred historically (a "Past Lock Situation").
Present Lock Situations can be analyzed using HANA Studio monitors. The monitors (described later in this document) will help the administrator to find the root cause of the lock and take the appropriate action.
Past Lock Situations are best analyzed when a time frame for performance degradation is known. For deeper analysis of either case, diagnosis information can be sent to SAP Customer Support.
    1. Present lock situations
  • Check the Alerts Monitor in HANA Studio for alerts indicating lock situations, which may serve as a starting point for the following investigations:
  • Check for blocked transactions ("Performance -> Load", "Performance -> Blocked Transactions", "Performance -> Threads") to determine the number of blocked transactions.
                       When transactions are blocked, identify the session that blocks other transactions from ("Performance -> Sessions")or ("Performance -> Blocked Transactions").

                       To resolve the blocking situation, the following actions can be taken.
                       1. Check the application logic of the currently blocking session to see if the application logic can be changed to avoid the blocking situation.
                       2. Consider changing the schedule or parameterization of the application modifying those database objects and thereby blocking those transactions.
                       3. You can terminate the blocking session manually if you need to resolve the situation immediately.
                       4. When further analysis is needed to find out the root cause of the blocking transaction, follow the instructions in later part of this SAP Note and collect diagnosis information and send the information to SAP Support.
  • The following "System Information" views provide additional information on blocked transactions, locks and sessions: "Open Transactions", "Blocked Transactions", "Table locks", "Record Locks", "Sessions".
  • A possible root cause for a temporary increase of internal locks are contentions on system resources. background processes.
  • A possible root cause for a temporary increase of internal locks are contentions on those system resources.
                       To determine temporal coincidences of deteriorated performance with increasing lock contention caused by background jobs, use the HANA Studio monitors "Performance -> Load", and "Performance -> Job Progress" and the "System Information" views to check the status of "(Delta) Merge Statistics", "Failed Backups", "Backup catalog" "MVCC blocker transaction", "MVCC blocker connection".
                       To find the times when savepoints have been written, submit a query on the system view "SYS"."M_SAVEPOINTS". The view SYS"."M_CS_UNLOADS" logs phases of intensive column data unloads from memory.
  • To find wait situations at a thread level, open the HANA Studio monitor "Performance -> Threads", select "Create call stacks", and repeatedly refresh the current page every few minutes. Check which threads do not change their call stacks even after many repetitions.
  • Unchanged call stacks can have many other root causes than lock contention, e.g., long running queries, open sessions waiting for a user action, or background processes.
                       To find out the root cause of a long running query, the following checks can be performed.
  • Number of row store version is over 1,000,000:
                       1. Check number of row store versions using the following query:
                       select * from m_mvcc_tables
                       If the number of versions is over 1 million, it might affect overall system performance. Therefore, we need to find out which transaction blocks garbage collection, possibly by long-running or unclosed cursors, long-running serializable/repeatable read isolation mode transactions, or hanging threads.
                       2. From "System Information" views, identify the transaction and connection information that is blocking the garbage collection by checking "MVCC blocker transaction" and "MVCC blocker connection".
                       3. Identify Query String of the problemaic cusror or query using the following query:
                       select * from m_prepared_statements where statement_id = (select current_statement_id from m_connections where connection_id = (select connection_id from m_transactions where min_mvcc_snapshot_timestamp = (select min(Value) from m_mvcc_tables where name = 'MIN_SNAPSHOT_TS') and connection_id > 0))
                       4. Get more information on the session context:
                       select * from m_session_context where connection_id = (select connection_id from m_transactions where min_mvcc_snapshot_timestamp = (select min(Value) from m_mvcc_tables where name = 'MIN_SNAPSHOT_TS') and connection_id > 0)
                       You can get more detailed information on the session such as application program, application user from this query.
                       5. Analyze the query why it takes long
                       Check application and solve the problem. For example, application program has to be changed if there is any unclosed cursor or uncommitted transaction.
                       6. If a persistent sequence is used with "NO CACHE" option, then please check SAP Note 1977214 SAP HANA: Growing number of rowstore table versions when a persistent sequence is used.
                       7. If the problem is not resolved, then follow the instructions in later part of this SAP Note and collect diagnosis information and send the information to SAP Support.
  • Number of row store version is less than 1000000, but still a performance is not good and need further investigation, then follow the instructions in later part of this SAP Note and collect diagnosis information and send the information to SAP Support.
  • When further analysis is needed to find out the root cause of the wait situation at a thread level, follow the instructions in later part of this SAP Note and collect diagnosis information and send the information to SAP Support.
    2. Past lock situations
  • Identify the exact time frame when the performance temporarily decreased.
  • For example, by using the "Performance -> Load" monitor, status of resource consumption such as CPU or memory usage or SQL workloads such as number of active connections, statements, blocked transactions, versions, active read / write requests, column unload status can be checked. HANA trace files also can be used to identify the time frame of problematic situation by checking alerts or warning / error messages.
                       The time frame is a valuable filter for further analysis.
                       When HANA database is not restarted after the performance downgrade, the "System Information" views can be checked further. If the performance issue is resolved without a database restart, then it indicates that the database was not in a hang situation, but the database performance was temporarily decreased.
                       If HANA database has been restarted after a hang situation, the history of database status can be found in "_SYS_STATISTICS" schema:
                       "HOST_LONG_RUNNING_STATEMENTS","HOST_LONG_RUNNING_SERIALIZABLE_TRANSACTION","HOST_LONG_IDLE_CURSOR","HOST_LONG_RUNNING_UNCOM MITTED_WRITE_TRANSACTION","HOST_MEMORY_STATISTICS","HOST_RESOURCE_UTILIZATION_STATISTICS","HOST_SAVEPOINTS", "HOST_VOLUME_IO_PERFORMANCE_STATISTICS"
  • To check if a savepoint was written during a pre-defined time period, submit a query on the view "SYS"."M_SAVEPOINTS" and specify a suitable time window for the corresponding attribute(s). A longer history can be found in "_SYS_STATISTICS"."HOST_SAVEPOINTS". Similarly, the view "SYS"."M_CS_UNLOADS" saves time information when column data is unloaded from memory.
  • To search for lock messages in trace files, open the HANA DB Studio "Diagnosis Files" view and primarily examine the indexserver alert trace files. In a scale-out environment, there is one such file for each host. Search for the strings "lock", "wait", and "transaction". Check if the messages refer to lock situations and the timestamps match the period of performance deterioration.
    3. Collect Diagnosis Information for SAP Support
                       To collect Diagnosis Information ("runtimedump") including configurations and trace files, follow the SAP Notes 1837439 "Activating the Emergency Support Package for DB support" and 1732157 "SAP HANA: Howto get system dump". Send the system dump archive to SAP Customer Support who will provide you an SAPMATS container link for the upload. Small dump archives can be directly attached to OSS messages.
                       This collection of Diagnosis information is helpful for both present and past lock situations.
                       If the HANA DB System currently shows a serious hang situation, repeat the collection of Diagnosis files again after a few minutes and additionally send the second dump archive to SAP Customer Support.


Header Data
Released On 13.02.2014 17:54:06
Release Status Released for Customer
Component HAN-DB SAP HANA Database
Priority Recommendations / Additional Info
Category Consulting

No comments:

Post a Comment