Find and Remove Duplicate Rows in a Table using SQL Server ROW_NUMBER() and CTE

Duplicate rows in tables can be very annoying for DBA’s and programmers, as it raises many uncomfortable questions about the authenticity of data in a database. The matter gets worse when company auditors complain about irregularities in the balance sheet etc.

Duplicate rows in SQL Server

Technically, a table must not, in anyway, accept duplicate set of records, in the first place. It is the job of a DBA to ensure that the database and its objects are designed with utmost care. The best way to design a database is to draw diagrams that will give you a clear picture of the entire database structure.

Anyways, we have found duplicate rows in our tables and we need to eliminate them. The first step is to trace the duplicate records or rows in a table. There are many ways to achieve this, but I found the ROW_NUMBER() method to be really simple.

Also Read: Learn about SQL Server LEFT() and RIGHT() functions

ROW_NUMBER()

SQL server ROW_NUMBER() returns the total count (in numbers) of a row. The sequence starts with the number 1. If a table has unique set of rows, then ROW_NUMBER() will assign each row a value of 1.

Create a table and add a few rows, particularly a set of duplicate rows.

CREATE TABLE dbo.Employee 
    (EmpID INT NOT NULL , 
        EmpName VARCHAR(50) NOT NULL, 
        Designation VARCHAR(50) NULL
    )

Now add rows to the newly created table Employee.

INSERT INTO Employee (EmpID, EmpName, Designation)
	SELECT 1, 'GORDEN', 'LAB ASSISTANT'
INSERT INTO Employee (EmpID, EmpName, Designation)
	SELECT 2, 'MIKE PEARL', 'SENIOR ACCOUNTANT'
INSERT INTO Employee (EmpID, EmpName, Designation)
	SELECT 3, 'ARUN', 'SR. PROGRAMMER'
INSERT INTO Employee (EmpID, EmpName, Designation)
	SELECT 4, 'DAVID', 'ACCOUNTANT'
INSERT INTO Employee (EmpID, EmpName, Designation)
	SELECT 5, 'MIKE PEARL', 'SENIOR ACCOUNTANT'

Note: You can use my sample database that I have designed to test various functions at the front end as well as at the database level. This will save you some time.

Now lets assume, I have accidently inserted two similar set of records for the employee name MIKE PEARL. (Second and Fifth rows)

Find Duplicate Rows using ROW_NUMBER()

Syntax

ROW_NUMBER () OVER ( [ <partition_by_clause> ] <order_by_clause> )

SELECT EmpName, ROW_NUMBER() OVER 
(
    PARTITION BY EmpName ORDER BY EmpID
) RowNumber
FROM  dbo.Employee

The PARTITION BY option is optional, you may or may not use it, but in this scenario, we need it.

The Output

SQL Server ROW_NUMBER

As you can see, the method has returned two values for Employee MIKE PEARL i.e. integers 1 and 2. The second value with the result 2 is clearly the duplicate value, which we have inserted accidently.

Fair enough. We found the duplicate row. The next step is to eliminate the duplicate that has a value greater one.

Related: Insert Multiple rows with a Single INSERT Statement using SQL Server TVC



Remove Duplicate Rows using CTE

With SQL Server 2005, Microsoft has introduced a tool called the Common Table Expression (CTE).

I have read a beautiful article on Code Project about CTE and I would recommend you to read this in full to get a better understanding about CTE and its usage.

The CTE in our example query will hold the record set (rows) retuned by the query using ROW_NUMBER() method.

;WITH CTE AS 
(
    SELECT EmpName, ROW_NUMBER() OVER 
    (
        PARTITION BY EmpName ORDER BY EmpID
    ) RowNumber
    FROM  dbo.Employee
)
SELECT *FROM CTE WHERE RowNumber > 1

The Output

SQL Server CTE

The image above shows how the row, which we will eliminate, looks like. Now let us remove it. Instead of SELECT we will use the DELETE command to finish is off.

;WITH CTE AS 
(
SELECT EmpName, ROW_NUMBER() OVER 
(
    PARTITION BY EmpName ORDER BY EmpID
) RowNumber
FROM  dbo.Employee
)
DELETE FROM CTE WHERE RowNumber > 1

This procedure is capable of removing thousands of rows in a flash (now, that is efficiency). You can check this by adding the five rows multiple times, until satisfied that you have thousands of rows in the table.

Conclusion

As we have reached the end of this article, let us see a recap. The sheer size of this article can undermine the simple query or solution that is to find and remove duplicate rows in a table. I have sliced the article into three sections. One - trace or find the duplicate row using ROW_NUMBER, two – view the record set using CTE with duplicate row, and three – remove the duplicate row using both the functions.

I hope you find this article useful and interesting. Thanks for reading.

Previous - What Is a Stored Procedure and Why Use it? - SQL ServerNext - Insert Multiple rows with a Single INSERT Statement using SQL Server Table Value Constructor



Like this Article? Subscribe now, and get all the latest articles and tips, right in your inbox.

Enter your email id

Delivered by FeedBurner
Tweet this article Google+

Related Posts: