Technically, a table must not, in anyway, accept duplicate set of records in the first place. It is the job of a DBA to ensure that the database and its objects are designed with utmost care. The best way to design a database is to draw diagrams that will give you a clear picture of the entire database structure.
Anyways, we have found duplicate rows in our tables and we need to eliminate them. The first step is to trace the duplicate records or rows in a table. There are many ways to achieve this, but I found a simple method called the ROW_NUMBER().
SQL server ROW_NUMBER() returns the total count (in numbers) of a row. The sequence starts with the number 1. If a table has unique set of rows, then ROW_NUMBER() will assign each row a value of 1.
Create a table and add a few rows, particularly a set of duplicate rows.
CREATE TABLE dbo.Employee (EmpID INT NOT NULL , EmpName VARCHAR(50) NOT NULL, Designation VARCHAR(50) NULL )
Now add rows to the newly created table Employee.
INSERT INTO Employee (EmpID, EmpName, Designation) SELECT 1, 'GORDEN', 'LAB ASSISTANT' INSERT INTO Employee (EmpID, EmpName, Designation) SELECT 2, 'MIKE PEARL', 'SENIOR ACCOUNTANT' INSERT INTO Employee (EmpID, EmpName, Designation) SELECT 3, 'ARUN', 'SR. PROGRAMMER' INSERT INTO Employee (EmpID, EmpName, Designation) SELECT 4, 'DAVID', 'ACCOUNTANT' INSERT INTO Employee (EmpID, EmpName, Designation) SELECT 5, 'MIKE PEARL', 'SENIOR ACCOUNTANT'
Note: You can use my sample database that I have designed to test various functions at the front end as well as at the database level. This will save you some time.
Now lets assume, I have accidently inserted two similar set of records for the employee name MIKE PEARL. (Second and Fifth rows)
Find Duplicate Rows using ROW_NUMBER()
ROW_NUMBER () OVER ( [ <partition_by_clause> ] <order_by_clause> )
SELECT EmpName, ROW_NUMBER() OVER ( PARTITION BY EmpName ORDER BY EmpID ) RowNumber FROM dbo.Employee
The PARTITION BY option is optional, you may or may not use it, but in this scenario, we need it.
As you can see, the method has returned two values for Employee MIKE PEARL i.e. integers 1 and 2. The second value with the result 2 is clearly the duplicate value, which we have inserted accidently.
Fair enough. We found the duplicate row. The next step is to eliminate the duplicate that has a value greater one.
Remove Duplicate Rows using CTE
With SQL Server 2005, Microsoft has introduced a tool called the Common Table Expression (CTE).
I have read a beautiful article on Code Project about CTE and I would recommend you to read this in full to get a better understanding about CTE and its usage.
The CTE in our example query will hold the record set (rows) retuned by the query using ROW_NUMBER() method.
;WITH CTE AS ( SELECT EmpName, ROW_NUMBER() OVER ( PARTITION BY EmpName ORDER BY EmpID ) RowNumber FROM dbo.Employee ) SELECT *FROM CTE WHERE RowNumber > 1
The image above shows how the row, which we will eliminate, looks like. Now let us remove it. Instead of SELECT we will use the DELETE command to finish is off.
;WITH CTE AS ( SELECT EmpName, ROW_NUMBER() OVER ( PARTITION BY EmpName ORDER BY EmpID ) RowNumber FROM dbo.Employee ) DELETE FROM CTE WHERE RowNumber > 1
This procedure is capable of removing thousands of rows in a flash (now, that is efficiency). You can check this by adding the five rows multiple times, until satisfied that you have thousands of rows in the table.
As we have reached the end of this article, let us see a recap. The sheer size of this article can undermine the simple query or solution that is to find and remove duplicate rows in a table. I have sliced the article into three sections. One - trace or find the duplicate row using ROW_NUMBER, two – view the record set using CTE with duplicate row, and three – remove the duplicate row using both the functions.
I hope you find this article useful and interesting. Thanks for reading. ☺