Monday, February 1, 2016

1977220 - How to handle HANA Alert 54: Savepoint Duration

Symptom
You receive alerts on Savepoint Duration (Alert ID 54).
There may be overall performance issues in the HANA system with the response time of dialog queries and runtime of jobs, long running backups, temporarily a high percentage of blocked transactions (Alert ID 39) or general performance issues.



Cause
The duration of savepoints is exceeding the configured thresholds due to performance issues in your SAP HANA System. The default configuration for this alert is as follows:
Severity Low Medium High
Threshold 5 min 10 min 15 min


Resolution

Procedure

  1. Check the Alert Status with regard to performance alerts indicating CPU or IO related issues.
  2. Get details on the savepoint duration using the SQL statement HANA_IO_Savepoints_Rev71+ (HANA_IO_Savepoints) from KBA 1969700 - SQL Statement Collection for SAP HANA.
  3. Analyze IO Issues.
  4. Analyze CPU usage related issues.
  5. Perform further technical performance analysis.

Path

  1. SAP HANA Studio - SAP HANA Administration Console - Alerts.
  2. SAP Solution Manager - Transaction SM_WORKCENTER - Technical Monitoring Workcenter - System Monitoring.
  3. SAP HANA Studio  - SAP HANA Administration Console - SQL Console.

How-to

1. Check Alert Status with regard to performance alerts indicating CPU or IO related issues

Check in particular for the following performance alerts:
Layer Alert ID Alert Name Issue
SAP HANA 5 Host CPU Usage high CPU usage
60 Sync/async read ratio IO issue (storage, network)
61 Sync/async write ratio IO issue (storage, network)
If you operate your system with SAP Solution Manager, then check for the following alerts:
Layer Alert Name Issue
SAP HANA Savepoint critical phase duration If red then the reported alert on savepoint duration is critical
Host Disk I/O - Average Wait time per disk IO issue (storage)
CPU - CPU Utilization high CPU usage
CPU - I/O Wait CPU I/O performance

2.  Get details on savepoint duration using the SQL statement HANA_IO_Savepoints_Rev71+ (HANA_IO_Savepoints)
     from KBA 1969700 - SQL Statement Collection for SAP HANA

Execute the SQL Statement HANA_IO_Savepoints_Rev71+ (HANA_IO_Savepoints when your HANA revision is lower than 71) in the SQL Console of SAP HANA Studio.
SELECT
  START_TIME,
  CRIT_PHASE_START_TIME,
  HOST,
  PORT,
  SAVEPOINTS,
  SUM_DURATION_S,
  MAX_DURATION_S,
  AVG_DURATION_S,
  MAX_CRITICAL_PHASE_S,
  AVG_CRITICAL_PHASE_S,
  SUM_SIZE_MB,
  AVG_SIZE_MB,
  RS_SIZE_PCT
FROM
( SELECT
    CASE WHEN NUM_SAVEPOINTS = 1 THEN CRIT_PHASE_START_TIME ELSE 'various' END CRIT_PHASE_START_TIME,
    START_TIME,
    HOST,
    LPAD(PORT, 5) PORT,
    LPAD(NUM_SAVEPOINTS, 10) SAVEPOINTS,
    LPAD(TO_DECIMAL(SUM_DURATION_S, 12, 2), 14) SUM_DURATION_S,
    LPAD(TO_DECIMAL(MAX_DURATION_S, 12, 2), 14) MAX_DURATION_S,
    LPAD(TO_DECIMAL(AVG_DURATION_S, 12, 2), 14) AVG_DURATION_S,
    LPAD(TO_DECIMAL(MAX_CRITICAL_PHASE_S, 12, 2), 20) MAX_CRITICAL_PHASE_S,
    LPAD(TO_DECIMAL(AVG_CRITICAL_PHASE_S, 12, 2), 20) AVG_CRITICAL_PHASE_S,
    LPAD(ROUND(SUM_SIZE_MB), 11) SUM_SIZE_MB,
    LPAD(ROUND(AVG_SIZE_MB), 11) AVG_SIZE_MB,
    LPAD(TO_DECIMAL(RS_SIZE_PCT, 12, 2), 11) RS_SIZE_PCT
  FROM
  ( SELECT
      CASE WHEN BI.AGGREGATE_BY = 'NONE' OR INSTR(BI.AGGREGATE_BY, 'TIME') != 0 THEN TO_CHAR(SP.START_TIME, BI.TIME_AGGREGATE_BY) ELSE 'any'                             END START_TIME,
      CASE WHEN BI.AGGREGATE_BY = 'NONE' OR INSTR(BI.AGGREGATE_BY, 'HOST') != 0 THEN SP.HOST                                      ELSE MAP(BI.HOST, '%', 'any', BI.HOST) END HOST,
      CASE WHEN BI.AGGREGATE_BY = 'NONE' OR INSTR(BI.AGGREGATE_BY, 'PORT') != 0 THEN TO_CHAR(SP.PORT)                             ELSE MAP(BI.PORT, '%', 'any', BI.PORT) END PORT,
      TO_CHAR(MIN(SP.CRITICAL_PHASE_START_TIME), 'YYYY/MM/DD HH24:MI:SS') CRIT_PHASE_START_TIME,
      COUNT(*) NUM_SAVEPOINTS,
      SUM(SP.DURATION) / 1000000 SUM_DURATION_S,
      MAX(SP.DURATION) / 1000000 MAX_DURATION_S,
      AVG(SP.DURATION) / 1000000 AVG_DURATION_S,
      MAX(SP.CRITICAL_PHASE_DURATION + SP.CRITICAL_PHASE_WAIT_TIME) / 1000000 MAX_CRITICAL_PHASE_S,
      AVG(SP.CRITICAL_PHASE_DURATION + SP.CRITICAL_PHASE_WAIT_TIME) / 1000000 AVG_CRITICAL_PHASE_S,
      SUM(SP.TOTAL_SIZE) / 1024 / 1024 SUM_SIZE_MB,
      AVG(SP.TOTAL_SIZE) / 1024 / 1024 AVG_SIZE_MB,
      AVG(MAP(SP.TOTAL_SIZE, 0, 0, SP.FLUSHED_ROWSTORE_SIZE / SP.TOTAL_SIZE * 100)) RS_SIZE_PCT
    FROM
    ( SELECT
        HOST,
        PORT,
        BEGIN_TIME,
        END_TIME,
        MIN_CRITICAL_PHASE_DURATION_S,
        MIN_SAVEPOINT_DURATION_S,
        DATA_SOURCE,
        AGGREGATE_BY,
        MAP(TIME_AGGREGATE_BY,
          'NONE',        'YYYY/MM/DD HH24:MI:SS:FF7',
          'HOUR',        'YYYY/MM/DD HH24',
          'DAY',         'YYYY/MM/DD (DY)',
          'HOUR_OF_DAY', 'HH24',
          TIME_AGGREGATE_BY ) TIME_AGGREGATE_BY
      FROM
      ( SELECT                                                         /* Modification section */
          '%' HOST,
          '%' PORT,
          TO_TIMESTAMP('1000/01/12 12:00:00', 'YYYY/MM/DD HH24:MI:SS') BEGIN_TIME,
          TO_TIMESTAMP('9999/01/13 12:00:00', 'YYYY/MM/DD HH24:MI:SS') END_TIME,
          -1 MIN_CRITICAL_PHASE_DURATION_S,
          -1 MIN_SAVEPOINT_DURATION_S,
          'HISTORY' DATA_SOURCE,                        /* CURRENT, HISTORY */
          'NONE' AGGREGATE_BY,         /* HOST, PORT, TIME or comma separated combinations, NONE for no aggregation */
          'NONE' TIME_AGGREGATE_BY     /* HOUR, DAY, HOUR_OF_DAY or database time pattern, NONE for no aggregation */
        FROM
          DUMMY
      )
    ) BI,
    ( SELECT
        'CURRENT' DATA_SOURCE,
        HOST,
        PORT,
        START_TIME,
        CRITICAL_PHASE_START_TIME,
        DURATION,
        CRITICAL_PHASE_DURATION,
        CRITICAL_PHASE_WAIT_TIME,
        TOTAL_SIZE,
        FLUSHED_ROWSTORE_SIZE
      FROM
        M_SAVEPOINTS
      UNION ALL
      SELECT
        'HISTORY' DATA_SOURCE,
        HOST,
        PORT,
        START_TIME,
        START_TIME CRITICAL_PHASE_START_TIME,          /* not available in history */
        DURATION,
        CRITICAL_PHASE_DURATION,
        0 CRITICAL_PHASE_WAIT_TIME,                    /* not available in history */
        TOTAL_SIZE,
        FLUSHED_ROWSTORE_SIZE
      FROM
        _SYS_STATISTICS.HOST_SAVEPOINTS
    ) SP
    WHERE
      SP.HOST LIKE BI.HOST AND 
      TO_CHAR(SP.PORT) LIKE BI.PORT AND  
      SP.START_TIME BETWEEN BI.BEGIN_TIME AND BI.END_TIME AND
      ( BI.MIN_CRITICAL_PHASE_DURATION_S = -1 OR
        (SP.CRITICAL_PHASE_DURATION) / 1000000 >= BI.MIN_CRITICAL_PHASE_DURATION_S ) AND
      ( BI.MIN_SAVEPOINT_DURATION_S = -1 OR
        (SP.DURATION) / 1000000 >= BI.MIN_SAVEPOINT_DURATION_S ) AND
      SP.DATA_SOURCE LIKE BI.DATA_SOURCE
    GROUP BY
      CASE WHEN BI.AGGREGATE_BY = 'NONE' OR INSTR(BI.AGGREGATE_BY, 'TIME') != 0 THEN TO_CHAR(SP.START_TIME, BI.TIME_AGGREGATE_BY) ELSE 'any'                             END,
      CASE WHEN BI.AGGREGATE_BY = 'NONE' OR INSTR(BI.AGGREGATE_BY, 'HOST') != 0 THEN SP.HOST                                      ELSE MAP(BI.HOST, '%', 'any', BI.HOST) END,
      CASE WHEN BI.AGGREGATE_BY = 'NONE' OR INSTR(BI.AGGREGATE_BY, 'PORT') != 0 THEN TO_CHAR(SP.PORT)                             ELSE MAP(BI.PORT, '%', 'any', BI.PORT) END
  )
)
ORDER BY
  START_TIME DESC,
  HOST,
  PORT
[OUTPUT PARAMETERS]
- START_TIME:            Start time of considered time interval
- CRIT_PHASE_START_TIME: Start time of (first) critical phase
- HOST:                  Host name
- PORT:                  Port
- SAVEPOINTS:            Number of savepoints
- SUM_DURATION_S:        Total savepoint duration (in s)
- MAX_DURATION_S:        Maximum savepoint duration (in s)
- AVG_DURATION_S:        Average savepoint duration (in s)
- MAX_CRITICAL_PHASE_S:  Maximum duration of critical path (in s)
- AVG_CRITICAL_PHASE_S:  Average duration of critical path (in s)
- SUM_SIZE_MB:           Total savepoint write size (in MB)
- AVG_SIZE_MB:           Average savepoint write size (in MB)
- RS_SIZE_PCT:           Flushed data related to rowstore compared to overall flushed data (in %)
[EXAMPLE OUTPUT]
IO_Savepoint.jpg

If critical savepoint phases repeatedly take more than 2 seconds than the issue should be handled with high priority.

3. Analyze IO Issues

If the alert status suggests IO related issues, then follow the analysis recommendations outlined in KBA 1999930 - FAQ: SAP HANA I/O Analysis.

4. Analyze CPU usage related issues

If the alert status suggests CPU usage related issues, then follow the analysis recommendations outlined in KBA 1909670 - How to Handle Alert ‘Check host CPU utilization’.

5. Perform further technical performance analysis

If further technical performance analysis from HANA side seems to be required, then create and proceed a Technical Performance Analysis Session from SAP Engagemenmt and Service Delivery workcenter in your SAP Solution Manager.
If further assistance is required create an SAP incident on component SV-BO-DB-HAN and attach the documentation of the performed analysis results.


See Also
The attachment of this KBA provides a more detailed document for analyzing savepoint duration. While the document follows the approach outlined in the KBA text, there are some minor deviations on detail level (e.g. use of alternative SQL-Statements for retrieving information from system views and statistics tables).
Further information is available in the SAP HANA Troubleshoouting and Performance Analysis Guide.


Keywords

HANA Alert
Savepoint
Operations Recommendation
#OpsRec-HANA

No comments:

Post a Comment