Duplicate rows in tables can be very annoying for DBA’s and programmers, as it raises many uncomfortable questions about the authenticity of data in a database. The matter gets worse when company auditors complain about irregularities in the balance sheet etc.
Technically, a table must not, in anyway, accept duplicate set of records in the first place. It is the job of a DBA to ensure that the database and its objects are designed with utmost care. The best way to design a database is to draw diagrams that will give you a clear picture of the entire database structure.
Anyways, we have found duplicate rows in our tables and we need to eliminate them. The first step is to trace the duplicate records or rows in a table. There are many ways to achieve this, but I found a simple method called the ROW_NUMBER().
SQL server ROW_NUMBER() returns the total count (in numbers) of a row. The sequence starts with the number 1. If a table has unique set of rows, then ROW_NUMBER() will assign each row a value of 1.
Create a table and add a few rows, particularly a set of duplicate rows.
CREATE TABLE dbo.Employee (EmpID INT NOT NULL , EmpName VARCHAR(50) NOT NULL, Designation VARCHAR(50) NULL )
Now add rows to the newly created table Employee.
INSERT INTO Employee (EmpID, EmpName, Designation) SELECT 1, 'GORDEN', 'LAB ASSISTANT' INSERT INTO Employee (EmpID, EmpName, Designation) SELECT 2, 'MIKE PEARL', 'SENIOR ACCOUNTANT' INSERT INTO Employee (EmpID, EmpName, Designation) SELECT 3, 'ARUN', 'SR. PROGRAMMER' INSERT INTO Employee (EmpID, EmpName, Designation) SELECT 4, 'DAVID', 'ACCOUNTANT' INSERT INTO Employee (EmpID, EmpName, Designation) SELECT 5, 'MIKE PEARL', 'SENIOR ACCOUNTANT'
Note: You can use this sample database that I have designed to test various functions at the front end as well as at the database level. This will save you some time.
Now let us assume, I have accidently inserted two similar set of records for the employee name MIKE PEARL. (Second and Fifth rows)
Find Duplicate Rows using ROW_NUMBER()
ROW_NUMBER () OVER ( [ <partition_by_clause> ] <order_by_clause> )
SELECT EmpName, ROW_NUMBER() OVER ( PARTITION BY EmpName ORDER BY EmpID ) RowNumber FROM dbo.Employee
The PARTITION BY option is optional, you may or may not use it, but in this scenario, we need it.
Fair enough. We found the duplicate rows. The next step is to eliminate the duplicate that has a value greater one.
Remove Duplicate Rows using CTE
With SQL Server 2005, Microsoft has introduced a tool called the Common Table Expression (CTE).
I have read a beautiful article on Code Project about CTE and I would recommend you to read this in full to get a better understanding about CTE and its usage.
The CTE in our example query will hold the record set (rows) retuned by the query using ROW_NUMBER() method.
;WITH CTE AS ( SELECT EmpName, ROW_NUMBER() OVER ( PARTITION BY EmpName ORDER BY EmpID ) RowNumber FROM dbo.Employee ) SELECT *FROM CTE WHERE RowNumber > 1
;WITH CTE AS ( SELECT EmpName, ROW_NUMBER() OVER ( PARTITION BY EmpName ORDER BY EmpID ) RowNumber FROM dbo.Employee ) DELETE FROM CTE WHERE RowNumber > 1
This procedure is capable of removing thousands of rows in a flash (now, that's called efficiency). You can check this by adding the five rows multiple times, until satisfied that you have thousands of rows in the table.
As we have reached the end of this article, let's see a recap. The sheer size of this article can undermine the simple query or solution that is to find and remove duplicate rows in a table. I have sliced the article into three sections. One - trace or find the duplicate row using ROW_NUMBER, two – view the record set using CTE with duplicate row, and three – remove the duplicate row using both the functions.
I hope you find this article useful and interesting. Thanks for reading. ☺