Raghu On TechPro tips on managing data at scale
Slow Hash Joins In pureScale
I was contacted by fellow team members saying replication from their non-Db2 RDBMS to DB2 has been running with a lot of latency and they are noticing slowness specifically with deletes.
- Third party Replication software logs are also logging that latency is on the target Db2 side.
- This is not a recurring problem as per the other DBA’s.
Before I come to any conclusions, I would like establish the facts for any problem. Some of the facts that I gathered are to check and see if the deletes are really taking long on the target Db2 side. There are a number of ways to do it for example I could use below SQL to check currently executing SQL statements on the cluster.
SELECT t.application_handle, t.UOW_ID, t.ACTIVITY_ID, LOCAL_START_TIME, varchar(SESSION_AUTH_ID,20) as SESSION_AUTH_ID, substr(ACTIVITY_STATE,1,10) as ACTIVITY_STATE, substr(ACTIVITY_TYPE,1,10) as ACTIVITY_TYPE, substr(STMT_TEXT,1,10000) as STMT_TEXT FROM table(MON_GET_ACTIVITY(NULL, -2)) as a, table(MON_GET_AGENT('','',cast(NULL as bigint), -2)) as t where a.application_handle=t.application_handle;
In my case though I have WLM configured and I just enabled a trace on the workload that I am interested in. I am going to write a detailed blog post on taking advantage of WLM in a micro-services world. I used below command to capture the SQL statements from REPLICATION workload.
alter workload WLOAD_REPLICATION collect activity data on all database partitions with details and values;
I let the activity collection run for about 10 minutes while the issue is happening. Then I used below SQL to identify if the DELETE’s are really taking long.
select a.TIME_STARTED, a.TIME_COMPLETED, a.appl_ID, a.ACTIVITY_ID, a.UOW_ID, TIMESTAMPDIFF(1,CHAR(TIME_COMPLETED-TIME_STARTED))/1000000 as exec_in_seconds, STMT_TEXT from wlm.ACTIVITY a, wlm.ACTIVITYSTMT s where a.APPL_ID=s.appl_id and a.UOW_ID=s.UOW_ID and a.ACTIVITY_ID=s.activity_ID and SESSION_AUTH_ID='REPLICATION' and TIME_CREATED>'2019-08-26-20.00.00.556969' and STMT_TEXT LIKE 'DELETE%' order by TIMESTAMPDIFF(1,CHAR(TIME_COMPLETED-TIME_STARTED))/1000000 DESC WITH UR;
As complained by the fellow DBA’s below output confirmed that a simple 2 row delete is taking over 10 minutes. Another interesting observation is that the delete slowness is proportional to the size of the table.
So how to make sense of it ? I was not able to reproduce the slowness from the Db2 server it self, I checked the db2exfmt plan and it looks very simple and no indications of trouble. Good thing is I used WLM/Activity event monitors to trace the SQL statements while the issue was occurring. These event monitors provide a ton of information to troubleshoot these kind of tricky issues. These event monitors record various portions of wait times if any for every activity. Below are the wait times for a given activity that are available to you via ACTIVITYMETRICS event monitor table.
That is an abundant amount of information to troubleshoot almost any kind of performance problem related to SQL statement executing slow with in Db2. I basically selected all the wait-times from this table and below was the first clue of where the slowness is occurring. I basically used the APPL_ID, ACTIVITY_ID and UOW_ID from the first output screenshot showing 1205 seconds of execution time and used it as the predicate to below query.
SELECT TOTAL_ACT_TIME, TOTAL_ACT_WAIT_TIME, RECLAIM_WAIT_TIME, CF_WAIT_TIME, CF_WAITS FROM WLM.ACTIVITYMETRICS WHERE APPL_ID='18.104.22.168.19082706344' and ACTIVITY_ID=3 and UOW_ID=1 WITH UR;
So, in the above output the time(s) are in milliseconds. As you can see the total wait time is 1014 seconds and all of those 1014 seconds have been spent on the CF ? Another interesting and more striking number from the above output is close to 17 million CF waits (or calls) occurred for a simple 2 row delete SQL statement ? This does not seem right to me…
When I ran it on the server in a controlled environment I saw about 15 CF calls for the same delete statement, I was scratching my head wondering how is this possible ? At this point I started to wonder if the access plan that is being used by the deletes from replication software is drastically different from the access plan I am seeing on the server.
However, the DELETE statement generated by replication uses a join to the change table (Replications internal table) to perform the DELETE. Below is the complete DELETE statement that is being executed against the Db2.
DELETE FROM "MYSCHEMA1"."MYTABLE1" WHERE EXISTS ( SELECT "MYSCHEMA1"."attrep_changes5A5ADB4EE9F46570"."seg1", "MYSCHEMA1"."attrep_changes5A5ADB4EE9F46570"."seg2" FROM "MYSCHEMA1"."attrep_changes5A5ADB4EE9F46570" WHERE "MYSCHEMA1"."MYTABLE1"."SRC_SCHEMA" = CAST( "MYSCHEMA1"."attrep_changes5A5ADB4EE9F46570"."seg1" as VARCHAR(4)) and "MYSCHEMA1"."MYTABLE1"."MYTABLE1_PK" = CAST( "MYSCHEMA1"."attrep_changes5A5ADB4EE9F46570"."seg2" as INTEGER) and "MYSCHEMA1"."attrep_changes5A5ADB4EE9F46570"."seq" >= ? and "MYSCHEMA1"."attrep_changes5A5ADB4EE9F46570"."seq" <= ? );
Once again ACTIVITY event monitor to the rescue, you can collect activities with SECTION information and using this SECTION information you could populate the explain tables with the SQL execution environment information and then format the output using db2exfmt command. Since I already have WLM enabled and I have a WLM THRESHOLD in place that captures any SQL statements (along with its activity and SECTION information) that are utilizing more than 30 seconds of CPU, I already have the SECTION information for this SQL captured. Now all I need to do is extract the captured section information to get the run-time plan used by the slow running DELETE’s from replication.
I found below link on IBM Knowledge center and used the EXPLAIN_FROM_DATA stored procedure to populate the explain tables, as I had hard time getting EXPLAIN_FROM_SECTION stored proc working for some reason.
Below I am going to show the formatted explains for the poorly performing delete vs optimally performing delete.
Optimally Performing Delete:
Above section is the critical part of the explain plan, showing the good plan is simply using a nested loop join and cost of this plan is just 65 timerons with just about 15 CF calls.
Poorly Performing Delete:
Interesting thing is for the above plan that was used by the poorly performing DELETE, instead of a NESTED LOOP a HASH JOIN is taking place. But why would a HSJOIN increase the run time of a DELETE from 5 milliseconds to 9 minutes 🙁 ??????
Two observations bothered me here, first one is CF calls close to the number of records in the table on which DELETE is being performed (remember only 2 records are being deleted) and second one is the plan with HSJOIN is performing at an unacceptable levels. Then my next step is reproducing the problem and confirm that HSJOIN is in fact causing these CF calls.
To reproduce this, one of my fellow DBA’s copied the tables involved in the DELETE statement and forced the DELETE to use a HASH JOIN by using OPTIMIZATION guidelines. Finally I found a good use case for OPTIMIZATION GUIDELINES 🙂 More on OPT guidelines on below IBM knowledge center article.
DELETE FROM tmp.t02 b WHERE EXISTS (SELECT a."seg1", a."seg2" FROM tmp.t01 a WHERE b.SRC_SCHEMA = CAST(a."seg1" as VARCHAR(4)) and b.RRN = CAST(a."seg2" as BIGINT) and a."seq" >= 0 and a."seq" <= 99 ) /* <OPTGUIDELINES> <HSJOIN> <IXSCAN TABID='Q3' /> <IXSCAN TABID='Q2' /> </HSJOIN> </OPTGUIDELINES> */;
I was able to reproduce the slowness and CF calls by running the above SQL statement. While the above statement was running, I started monitoring the CF calls using below SQL statement.
while : do db2 -x "SELECT current timestamp, (TOTAL_ACT_WAIT_TIME/1000) AS TOTAL_ACT_WAIT_TIME_IN_SECONDS, CF_WAITS, (CF_WAIT_TIME/1000) AS CF_WAIT_TIME_IN_SECONDS FROM TABLE(MON_GET_CONNECTION(159638, -2)) AS t WITH UR" sleep 60 done
Above command uses the MON_GET_CONNECTION table function and it uses the application handle of the connection as one of the INPUT parameters. 159638 is the application handle for the session that I am running the delete from. You may use below SQL to display the current session application handle.
db2 "select sysproc.mon_get_application_handle() from sysibm.sysdummy1 with ur"
Below is the output from the while loop command above,
As you can see from the above screen shot, it made a total of “10181562” CF calls and about 8 minutes and 40 seconds to execute a simple delete of 2 rows. Next thing that’s striking is that these CF calls are close to the number of records in the table.
So whats up with the hash join for a DELETE involving a Correlated sub-query ? Well the answer lies in how hash join works as opposed to a nested loop join and when you take CF into account in a pureScale environment it may be working as designed but definitely NOT what we want.
Look at below Knowledge center page on how different joins operate in Db2.
Here are the facts I know so far…
- The REPLICATION change table is usually very small and optimizer finds using NLJOIN very attractive and no problems with performance with 5 ms response time.
- When the Change table that’s holding the keys to delete the target table grows large Db2 may find HSJOIN attractive, and this issue will only happen if, for some reason the execution plan gets flushed out of the package cache and Db2 recompiles using the latest cardinalities while the change table is large.
- This issue is only observed when you try to DELETE and potentially UPDATE based on a correlated sub-query that uses a HSJOIN with OUTER table being the table that the records are being deleted from.
- SELECT‘s with UR and CS isolation level do not have this problem.
- When Db2 is using a HSJOIN a simple 2 row DELETE is taking about 9 minutes JEEEEZ…
So, the big question why so many CF calls when using a HSJOIN ? HASH JOIN operates by hashing the inner table, which in our plan is MYSCHEMA1.attrep_changes5A5ADB4EE9F46570 table. Then outer table/index, in our case “index” will be scanned key by key and compared against the hash values that was computed for the columns of the join predicates. However, to do this comparision for every key Db2 needs to know that this specific key/row has not been updated on any other member on the cluster in a pureScale environment. As a result it needs a trip to CF and that’s the only logical reason that I can think of.
Where as for NLJOIN, Db2 is just traversing through the B-Tree and when it finds a match, then only for those keys it checks with the CF (to see if the rows have been dirtied by other members). If you look at this its like NLJOIN is doing an index lookup based on a key where as HSJOIN is doing an Index scan.
- Unfortunately, there is no resolution to this problem yet from the query slowness stand point and I am afraid this is working as designed in a pureScale environment. We opened a PMR with IBM to see what they have to say. It mostly requires a code change in the optimizer.
- If a small locking outage can be tolerated, when we lock the table in EXCLUSIVE MODE and perform this delete it just runs fine in 5 ms as Db2 is aware that table is locked exclusively and no other member can dirty the pages on this table.
- Since the last occurrence, it had not happened ever since, if it occurs frequently we may need to pursue options such as creating OPTIMIZATION PROFILES for the problematic DELETE’s coming through replication or try flushing the package cache during the issue and hope the future deletes use NLJOINs.
- IBM Db2 development team may need to look at this case and make an enhancement to the optimizer such that for the correlated DELETES and UPDATES, HSJOIN will never be picked when the OUTER table in the HSJOIN is the table that we are trying to DELETE from.
- Last but not least, options such as Explicit Hierarchical Locking (EHL) along with pinning workloads to a specific member for this table eliminate the CF traffic. However when you do this, you are defeating the purpose of pureScale cluster.
I am hoping that this post will help someone else that’s puzzled by this kind of performance problem. This can never happen in a non-pureScale environment and I am hoping IBM can fix this issue in one of their future releases as pureScale is being adopted more and more around the world and I personally think this issue could be hit fairly often in a large complex environment.