Tuesday, January 19, 2016

1999930 - FAQ: SAP HANA I/O Analysis

Symptom
You want to understand the impact of I/O operations on your SAP HANA database and determine potential bottlenecks.


Environment
SAP HANA


Cause
1. What is the definition of I/O in the context of this SAP Note?
2. Where do I find information about storage requirements in SAP HANA environments?
3. Why do I need to take care for I/O although SAP HANA is supposed to be an in-memory database?
4. Which indications exist that the SAP HANA database suffers from I/O accesses?
5. What can I check in order to understand if SAP HANA suffers from I/O?
6. What can I do if the SAP HANA related figures indicate bottlenecks in the I/O area?


Resolution

1. What is the definition of I/O in the context of this SAP Note?

I/O are input / output operations in various constellations, e.g. logical I/O when accessing memory, physical I/O when requesting blocks from the operating system respectively disk and network I/O when transferring data via network.
This SAP Note covers the area of disk I/O. This includes all read and write requests sent to the operating system - regardless if the actual data is physically read from disk or found in a low-level cache.

2. Where do I find information about storage requirements in SAP HANA environments?

Attached to SAP Note 1900823 you can find a storage whitepaper describing SAP HANA storage requirements in detail.

3. Why do I need to take care for I/O although SAP HANA is supposed to be an in-memory database?

It is correct that the majority of SAP HANA activities is purely performed in memory, but the following scenarios require I/O:
Scenario Details Impact of slow I/O
Row store load As part of the SAP HANA startup the row store is loaded from disk into memory. This involves a significant amount of read I/O. Increased SAP HANA startup times
Column store table load Also the tables located in the column store need to be loaded into memory before they are accessed. Unlike row store tables the loading doesn't take place during the startup. Instead it happens later on (e.g. during preloading or if the table is accessed the first time). As columns can be unloaded it is possible that the same columns are loaded into memory various times. Partially increased database request times
Hybrid LOBs Large values of hybrid LOB columns are not loaded into the column store. Instead they are always retrieved from disk (or from the SAP HANA internal page cache). See SAP Note 1994962 for more information. Reduced memory footprint, performance overhead when accessing the same data several times
Savepoints, snapshots During savepoints the information in memory is synchronized with the disks, so all modified information is flushed to disk involving write I/O to data files. Snapshots are savepoints that are kept for future use. INSERT / UPDATE / DELETE statements are blocked during critical savepoint phase
Commits When a commit is executed, all changes performed by the transaction are written to the logs involving write I/O to the log area. Increased commit times
Backups Backups copy data and log information in order to be able to restore the database in case of problems. Long running backups
If I/O is overloaded due to backup activities the overall system performance can suffer
Persistence garbage collection Persistence garbage collection cleans up no longer required relics on persistence level after a transactions are finished. See SAP Note 2169283 for more information related to garbage collection. SAP HANA internal locking and performance regressions if persistence garbage collection isn't able to keep of with the transaction load.
If it takes unnecessarily long to perform read or write I/O, the SAP HANA performance can be impacted.

4. Which indications exist that the SAP HANA database suffers from I/O accesses?

The following SAP HANA alerts indicate problems in the I/O area:
Alert Name SAP Note  Description
28 Most recent savepoint operation 1977291 Determines how long ago the last savepoint was defined, that is, how long ago a complete, consistent image of the database was persisted to disk.
34 Unavailable volumes 1900682 Determines whether or not all volumes are available
53 Pagedump files 1977242 Identifies new pagedump files that have been generated in the trace directory of the system.
54 Savepoint duration 1977220 Identifies long-running savepoint operations
60 Sync/Async read ratio Identifies a bad trigger asynchronous read ratio. This means that asynchronous reads are blocking and behave almost like synchronous reads. This might have negative impact on SAP HANA I/O performance in certain scenarios.
61 Sync/Async write ratio Identifies a bad trigger asynchronous write ratio. This means that asynchronous writes are blocking and behave almost like synchronous writes. This might have negative impact on SAP HANA I/O performance in certain scenarios.
SQL: "HANA_Configuration_MiniChecks" (SAP Notes 1969700, 1999993) returns a potentially critical issue (C = 'X') for one of the following individual checks:
Check ID Details
310 I/O read throughput data min. (MB/s, last day)
311 I/O read throughput data avg. (MB/s, last day)
315 I/O write throughput data min. (MB/s, last day)
316 I/O write throughput data avg. (MB/s, last day)
320 I/O read throughput log min. (MB/s, last day)
321 I/O read throughput log avg. (MB/s, last day)
325 I/O write throughput log min. (MB/s, last day)
326 I/O write throughput log avg. (MB/s, last day)
330 Max. trigger read ratio (data)
331 Max. trigger write ratio (data, log)
340 Log switch wait count ratio (%)
341 Log switch race count ratio (%)
350 Blocking savepoint phases > 10 s (last day)
351 Blocking savepoint phase avg. (s, last day)
352 Blocking savepoint phase max. (s, last day)
355 Time since last savepoint (s)
357 Savepoint write throughput (MB/s)
358 Savepoints taking longer than 900 s (last day)
360 Number of failed I/O reads
361 Number of failed I/O writes
370 Unused space in data files (%)
652 Number of page dumps (last day)
915 Min. data backup throughput (GB/h, last week)
916 Avg. data backup throughput (GB/h, last week)
1160 Average COMMIT time (ms)
More threads than usual have call stacks including:
__close_nocancel
DataRecovery::LoggerImpl::writeLog
disk_space
__endmntent_internal+0xa
FileAccess::DirectoryEntry::DirectoryEntry
FileAccess::DirectoryEntry::findFirst
FileAccess::DirectoryEntry::findNext
FileAccess::DirectoryIterator::operator++
FileAccess::fileSize
FileAccess::FileSystem::init
FileAccess::getFileSystem
__fopen_internal
__getdents
__GI__IO_file_open
__GI___lxstat64
__GI___statfs
_IO_new_fclose
_IO_new_file_close_it
_IO_new_file_fopen
_IO_setb_internal
__opendir
__open_nocancel
__readdir_r
__realpath
System::UX::endmntent
System::UX::opendir
System::UX::readdir_r
System::UX::stat
An increased number of SAP HANA threads with one of the following details exist (SAP Note 2114710):
Thread State Lock Name
Wait IO
Semaphore Wait LogBufferFreeWait
Semaphore Wait LoggerBufferSwitch
Semaphore Wait POSTCOMMIT_FINISH_SMP
Semaphore Wait PrefetchCallback
Semaphore Wait PrefetchIteratorCallback

5. What can I check in order to understand if SAP HANA suffers from I/O?

You can perform the following detailed checks to understand if and when there are indications for problems in the I/O area:
Action  SQL statement (SAP Note 1969700) Details
Check duration of critical savepoint phases SQL: "HANA_IO_Savepoints" The critical savepoint phase is particularly critical for the application because during that time all change operations (like INSERT, UPDATE, DELETE) are blocked. Long durations of the critical savepoint phase are often linked to I/O writes. If critical savepoint phases repeatedly take more than 2 seconds, and I/O issue might be the root cause.
See SAP Note 2100009 for more information regarding savepoints.
Check for long running savepoints SQL: "HANA_IO_Savepoints" Savepoints may run for a long time for several reasons, e.g. high write volume, I/O bottlenecks or a high prepare flush retry count. See SAP Note 2100009 for more information.
Check trigger ratios SQL: "HANA_IO_KeyFigures_Total" Increased trigger ratios can indicate synchronous rather than asynchronous I/O requests. This constellation can increase the runtime of critical I/O activities - e.g. the critical savepoint phases - and should therefore be avoided. See SAP Note 1930979 for more information.
Check current total I/O information  SQL: "HANA_IO_KeyFigures_Total" The total I/O information (collected since last SAP HANA restart) provides a first global perspective on I/O activities including average times and read and write throughputs. You can check if these values meet your expectations and if you see differences between different SAP HANA systems.
Be aware that the values also depend on the usual request sizes. If the average request size is higher, the average request times are typically higher, but at the same time the throughputs are better.
Check historic total I/O information SQL: "HANA_IO_KeyFigures_Total_History" The historic total I/O information provides further insight in the I/O activities over time. It can help you identifying times with degraded I/O times or with a particularly high I/O volume.
Check current detailed I/O information SQL: "HANA_IO_KeyFigures_Details" The detailed I/O information (collected since last SAP HANA restart) breaks the total values up into different I/O buffer size classes. In this way you can understand better how small, medium and large I/O requests behave. Additionally it is possible to distinguish between synchronous and asynchronous I/O.
Check historic detailed I/O information  SQL: "HANA_IO_KeyFigures_Details_History" The historic detailed I/O information provides the most granular I/O information, because it allows to differentiate by historic times, I/O buffer size classes and synchronous / asynchronous I/O. It can help to pinpoint temporary I/O performance regressions, e.g. due to increased load or external factors.
Check for tables with most I/O  SQL: "HANA_Tables_IOStatistics" This command can be used if a particularly high amount of I/O is caused by specific tables. If certain tables are identified, you can check from application side if the amount of changes can be reduced (e.g. delta load rather than full load, reduced tracing or logging).
Check for backups with a low throughput  SQL: "HANA_Backups_BackupRuns" The column MB_PER_S in the output of this command provides information about the backup throughput. Values below 200 GB / h can indicate bottlenecks in the I/O area.
Check active threads  SQL: "HANA_Threads_CurrentThreads" Increased activities of the following special threads can point to I/O problems in certain areas (see SAP Note 2114710).
Thread type / method I/O area
Generic (Search for trace files to compress)
LoadDataWriter
Trace volume
JobWorker (Backup::BackupExe_Job) Backup volumes
Request (remotediskinfo) All volumes
Savepoint Data volume

Check call stacks  SQL: "HANA_Threads_Callstacks" If an unusual high amount of threads is active in call stacks mentioned in "Which indications exist that the SAP HANA database suffers from I/O accesses?" above, I/O problems can be responsible for it.

6. What can I do if the SAP HANA related figures indicate bottlenecks in the I/O area?

If you found indications on SAP HANA side that disk I/O is slower than expected, you have to perform a more detailed analysis on lower layers of the I/O stack:
  • Operating system
  • File system
  • I/O sub system
  • Network between SAP HANA instances and I/O sub system
As a starting point you should make sure that the general SAP HANA recommendations for configuring these lower layers are in place. See SAP Note 2000003 ("How can the configuration and performance of the SAP HANA hardware, firmware and operating system be checked?") for more information.
SAP Note 1943937 provides a hardware check tool (hwcct) that can be used to check if the measured performance is in line with the expectations. SAP Note 2212741 contains some hwcct troubleshooting tips. 1661146 provides a tool to check SAP HANA appliances on Lenovo / IBM.
Be aware that the I/O performance rules of thumb don’t apply for X5 / Westmere based hardware (SAP Note 2187426) and worse values can be accepted if the business needs are met.
If the problem is not resolved with all recommended settings in place, you should get in touch with your teams and partners from the operating system, network, hardware and I/O area so that they can perform more detailed analysis on these lower layers of the I/O stack.

No comments:

Post a Comment