Raghu On TechPro tips on managing data at scale
DB2 Finding Duplicate Rows
As a DBA, you may get asked by development/other teams to get the list of duplicate rows based on a certain column or a key (non-primary). In this short blog post I am going to discuss a couple of ways to get the list of duplicates.
For e.g. lets assume you have a below data set
Now as you can see there are two sets of duplicates. One way to display the duplicates is to use the age old method of using “GROUP BY WITH HAVING”. In the past decade or so WINDOW functions aka OLAP functions became mainstream and lot of folks are using WINDOW functions as well.
db2 "select ID FROM DBA.TEST1 GROUP BY ID HAVING COUNT(ID)>1 WITH UR"
db2 "select ID FROM (select ID ,ROW_NUMBER() OVER(PARTITION BY ID) AS ROWNUM FROM DBA.TEST1) WHERE ROWNUM>1 WITH UR"
As you can see both methods produced the desired output and I am sure there are other ways to do it using self joins etc. but these are the two basic ways to pull out the duplicates.
Now lets say you want to delete one of these two duplicate records and you only want to keep the latest of the two based on a timestamp (see dataset below). You could easily accomplish this by using WINDOW/OLAP functions.
Before you delete anything always run a SELECT on the data just to make sure you are deleting the right data set.
db2 "SELECT ID, NAME, ROWTS FROM (select ID, NAME, ROWTS, ROW_NUMBER() OVER(PARTITION BY ID ORDER BY ROWTS DESC) AS ROWNUM from DBA.TEST1) WHERE ROWNUM>1"
As you can see from the above output, the older of the two records was selected. Now all you need to do is run a DELETE! Another thing that I always do on my DELETE/UPDATE’s is to run it with “-m” command line option.
db2 -m "DELETE FROM (select ID, NAME, ROWTS, ROW_NUMBER() OVER(PARTITION BY ID ORDER BY ROWTS DESC) AS ROWNUM from DBA.TEST1) WHERE ROWNUM>1"
It would be a great investment of your time to learn OLAP functions, if you use SQL in your day to day work. There are a lot of nice online resources or you can start with below IBM knowledge center link.
I am going to post another blog post about WLM and how I took advantage of OLAP functions to perform some really cool reporting.