Tag Archive for: T-SQL

Cleaning Data with Recursive CTE

SQL Server 2005 added a great new feature: Common Table Expressions (CTE). And even better than that – recursive CTEs. That provides a new powerful tool to solve many SQL problems. One of the areas where recursive CTEs shine is the hierarchical data management.

Here is another side of the recursive CTEs – utilizing them for some common tasks like cleaning data. The problem: a table has a column with values that have invalid characters. The task is to replace all those invalid characters with a space. Unfortunately the REPLACE function does not support pattern matching and each character in the column has to be verified individually and replaced if it falls in the invalid range. The solution below utilizes a recursive CTE to walk though the ACSII table of characters and to replace the invalid characters in the column values.

-- Create test table.

 

CREATE TABLE Foobar (

  key_col INT PRIMARY KEY,

  text_col NVARCHAR(100));

 

-- Populate sample data.

 

INSERT INTO Foobar VALUES (1, N'ABC!@#%DEFgh');

INSERT INTO Foobar VALUES (2, N'~!102WXY&*()_Z');

 

-- Perform the cleanup with recursive CTE.

 

WITH Clean (key_col, text_col, ch)

AS

(SELECT key_col,

        REPLACE(text_col, CHAR(255), ' '),

        255

 FROM Foobar

 UNION ALL

 SELECT key_col,

        CASE WHEN

            CHAR(ch - 1) NOT LIKE '[A-Z]'

            THEN REPLACE(text_col, CHAR(ch - 1), ' ')

            ELSE text_col END,

        ch - 1

 FROM Clean

 WHERE ch > 1)

SELECT key_col, text_col

FROM Clean

WHERE ch = 1

OPTION (MAXRECURSION 255);

On a side note – the recursive CTEs are not the best performers. Also, by default a CTE allows only 100 levels of recursion. The MAXRECURSION hint can be used to set higher level (a value between 0 and 32767; setting to 0 will remove the limit). Be aware that settings MAXRECURSION to 0 may create an infinite loop.

Here is a different method using utility table with numbers and FOR XML PATH, which is more effective:

WITH Clean (key_col, text_col)

AS

(SELECT key_col, REPLACE(CAST(

        (SELECT CASE

                  WHEN SUBSTRING(text_col, n, 1) LIKE '[A-Z]'

                  THEN SUBSTRING(text_col, n, 1)

                  ELSE '.'

                END

        FROM (SELECT number

               FROM master..spt_values

               WHERE type = 'P'

                AND number BETWEEN 1 AND 100) AS Nums(n)

        WHERE n <= LEN(text_col)

        FOR XML PATH('')) AS NVARCHAR(100)), '.', ' ')

 FROM Foobar)

SELECT key_col, text_col

FROM Clean;

Dates and Date Ranges in SQL Server

One of the most common tasks when working with data is to select data for a specific date range or a date. There are two issues that arise: calculating the date range and trimming the time portion in order to select the full days. Below are a few techniques to show how this can be done.

First, a look at the internals of the DATETIME data type in SQL Server. Its internal representation is as two 4-byte values. The first value represents the number of days since 01/01/1900. The second number is the number of milliseconds since midnight. Here is how those two values can be converted to VARBINARY and then to INT to extract the days and milliseconds portions.

-- The internal representation of a datetime is two 4-byte values

SELECT CURRENT_TIMESTAMP AS 'Today',

       CAST(CURRENT_TIMESTAMP AS VARBINARY(8)) AS 'Two 4-byte Internal',

       SUBSTRING(CAST(CURRENT_TIMESTAMP AS VARBINARY(8)), 1 , 4)

           AS 'Days Since 1900-01-01',

       SUBSTRING(CAST(CURRENT_TIMESTAMP AS VARBINARY(8)), 5 , 4)

           AS 'Milliseconds Since Midnight',

       CAST(SUBSTRING(CAST(CURRENT_TIMESTAMP AS VARBINARY(8)), 1 , 4) AS INT)

           AS 'Days Represented as INT',

       CAST(SUBSTRING(CAST(CURRENT_TIMESTAMP AS VARBINARY(8)), 5 , 4) AS INT)

           AS 'Milliseconds Represented as INT';

Below are a few basic date calculations that show the power of the DATETIME functions and how to calculate key dates that can be used to extract data:

-- Basic dates and date calculations

SELECT CAST('19000101 00:00' AS SMALLDATETIME) AS 'Min SMALLDATETIME',

       CAST('20790606 23:59' AS SMALLDATETIME) AS 'Max SMALLDATETIME',

       CAST('1753-01-01T00:00:00.000' AS DATETIME) AS 'Min DATETIME',

       CAST('9999-12-31T23:59:59.997' AS DATETIME) AS 'Max DATETIME',

       CAST(0 AS DATETIME) AS 'Base SQL Server DATETIME',

       CURRENT_TIMESTAMP AS 'Current Date/Time',

       DATEDIFF(DAY, 0, CURRENT_TIMESTAMP)

           AS 'Current Date/Time as Number (days since 1900-01-01)',

       CAST(DATEDIFF(DAY, 0, CURRENT_TIMESTAMP) AS DATETIME)

           AS 'Today as Date',

       CAST(DATEDIFF(DAY, -1, CURRENT_TIMESTAMP) AS DATETIME)

           AS 'Tomorrow as Date',

       CAST(DATEDIFF(DAY, 1, CURRENT_TIMESTAMP) AS DATETIME)

           AS 'Yesterday as Date',

       DATEADD(YEAR, DATEDIFF(YEAR, 0, CURRENT_TIMESTAMP)-1, 0)

           AS 'First Day of Last Year',

       DATEADD(YEAR, DATEDIFF(YEAR, 0, CURRENT_TIMESTAMP), 0)

           AS 'First Day of This Year',

       DATEDIFF(DAY, 0, DATEADD(YEAR, -1, CURRENT_TIMESTAMP))

           AS 'Today One Year Ago as Number',

       CAST(DATEDIFF(DAY, 0, DATEADD(YEAR, -1, CURRENT_TIMESTAMP)) AS DATETIME)

           AS 'Today One Year Ago as Date',

       DATEADD(MONTH, DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP), 0)

           AS 'First Day of Current Month',

       DATEADD(MONTH, DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP) + 1, 0)

           AS 'First Day of Next Month',

       DATEADD(MONTH, DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP) - 1, 0)

           AS 'First Day of Prior Month',

       DATEADD(MONTH, DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP) + 1, 0) - 1

           AS 'Last Day of Current Month',

       DATEADD(DAY, DATEDIFF(DAY, 0, CURRENT_TIMESTAMP), '01:30:00')

           AS '1:30 am today',

       CONVERT(VARCHAR(10), CURRENT_TIMESTAMP, 101) + ' 11:30:00.000'

           AS '11:00 am today',

       CONVERT(VARCHAR(10), CURRENT_TIMESTAMP, 101) + ' 17:00:00.000'

           AS '5:00 pm today',

       DATEADD(WEEK, DATEDIFF(WEEK, 0, CURRENT_TIMESTAMP), 0)

           AS 'First Day of This Week',

       DATEADD(WEEK, DATEDIFF(WEEK, 0, CURRENT_TIMESTAMP) - 1, 0)

           AS 'First Day of Last Week',

       DATEADD(WEEK, DATEDIFF(WEEK, 0, CURRENT_TIMESTAMP) + 1, 0)

           AS 'First Day of Next Week';

Let’s implement some date range searches to see this in action. Given table Orders with DATETIME data type column order_date, here are a few range selections:

-- Get Orders for Last Month

SELECT order_date

FROM Orders

WHERE order_date >= DATEADD(MONTH,

                      DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP) - 1, 0)

  AND order_date < DATEADD(MONTH,

                      DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP), 0);

 

-- Get Orders for Current Month

SELECT order_date

FROM Orders

WHERE order_date >= DATEADD(MONTH,

                      DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP), 0)

  AND order_date < DATEADD(MONTH,

                      DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP) + 1, 0);

 

-- Get Year to Date Orders

SELECT order_date

FROM Orders

WHERE order_date >= DATEADD(YEAR,

                      DATEDIFF(YEAR, 0, CURRENT_TIMESTAMP), 0)

  AND order_date < DATEDIFF(DAY, -1, CURRENT_TIMESTAMP);

 

-- Get Month to Date Orders

SELECT order_date

FROM Orders

WHERE order_date >= DATEADD(MONTH,

                      DATEDIFF(MONTH, 0, CURRENT_TIMESTAMP), 0)

  AND order_date < DATEDIFF(DAY, -1, CURRENT_TIMESTAMP);

 

-- Get Last Year's Orders

SELECT order_date

FROM Orders

WHERE order_date >= DATEADD(YEAR,

                      DATEDIFF(YEAR, 0, CURRENT_TIMESTAMP)-1, 0)

  AND order_date < DATEADD(YEAR,

                    DATEDIFF(YEAR, 0, CURRENT_TIMESTAMP), 0);

 

-- Get Today's Orders

SELECT order_date

FROM Orders

WHERE order_date >= DATEDIFF(DAY, 0, CURRENT_TIMESTAMP)

  AND order_date < DATEDIFF(DAY, -1, CURRENT_TIMESTAMP);

 

-- Get Yesterday's Orders

SELECT order_date

FROM Orders

WHERE order_date >= DATEDIFF(DAY, 1, CURRENT_TIMESTAMP)

  AND order_date < DATEDIFF(DAY, 0, CURRENT_TIMESTAMP);

 

-- Get Today's Orders Between 9:00 am And 11:00 am

SELECT order_date

FROM Orders

WHERE order_date BETWEEN

           CONVERT(VARCHAR(10), CURRENT_TIMESTAMP, 101) + ' 09:00:00.000'

       AND CONVERT(VARCHAR(10), CURRENT_TIMESTAMP, 101) + ' 11:00:00.000';

 

-- Or

SELECT order_date

FROM Orders

WHERE order_date BETWEEN

           DATEADD(DAY, DATEDIFF(DAY, 0, CURRENT_TIMESTAMP), '09:00:00')

       AND DATEADD(DAY, DATEDIFF(DAY, 0, CURRENT_TIMESTAMP), '11:00:00');

 

-- Get Orders for Last Week

SELECT order_date

FROM Orders

WHERE order_date >= DATEADD(WEEK,

                      DATEDIFF(WEEK, 0, CURRENT_TIMESTAMP) - 1, 0)

  AND order_date < DATEADD(WEEK,

                    DATEDIFF(WEEK, 0, CURRENT_TIMESTAMP), 0);

 

-- Get Orders for Current Week

SELECT order_date

FROM Orders

WHERE order_date >= DATEADD(WEEK,

                      DATEDIFF(WEEK, 0, CURRENT_TIMESTAMP), 0)

  AND order_date < DATEADD(WEEK,

                    DATEDIFF(WEEK, 0, CURRENT_TIMESTAMP) + 1, 0);

 

-- Get Orders for One Year Back From Current Date

SELECT order_date

FROM Orders

WHERE order_date >= DATEDIFF(DAY, 0,

                      DATEADD(YEAR, -1, CURRENT_TIMESTAMP))

  AND order_date < DATEDIFF(DAY, -1, CURRENT_TIMESTAMP);

There could be many more variations based on the same techniques to calculate dates. All those techniques use the datetime functions DATEADD and DATEDIFF to trim the time portion of the date in order to get accurate results. The key to this is the behavior of DATEDIFF to calculate difference between boundaries (that is when crossing between years, months, days, minutes, etc.). For example, difference in days between May 3, 2006 at 11:59 PM and May 4, 2006 at 00:01 AM returns one day, even it is only a couple minutes. Similar, calculating difference in years between Dec 31, 2006 and Jan 1, 2007 returns one year, even it is only a day or less. With this in mind it is very easy to accomplish the timing effect. Calculating difference one level higher trims the lower level. For example, to trim second calculate difference in minutes, to trim hours (or time in general) calculate difference in days, difference between months will trim days, and so on.

The same trimming can be done by converting to a string and then back to a date. But that is a slower and lengthy way of doing it. Here is a brief example of both methods:

-- Trim time portion

DECLARE @date DATETIME;

SET @date = CURRENT_TIMESTAMP;

 

-- Convertmg to string and back to date

SELECT CONVERT(DATETIME, CONVERT(CHAR(10), @date, 101)) ;

 

-- Using datetime functions

SELECT DATEADD(DAY, DATEDIFF(DAY, 0, @date),0);

As a bonus, here is how to convert time stored as a decimal number (like 5.25, 5.50, or 5.75) to a time format (like 5:15, 5:30, or 5:45):

-- Format Time From 5.25 to 5:15

-- Decimal Format to Time Format

DECLARE @time DECIMAL(5, 2);

SET @time = 5.25;

 

SELECT CONVERT(CHAR(5), DATEADD(ss, @time * 3600, 0), 108);

Just for clarification: the CURRENT_TIMESTAMP function returns the current date and time. This function is the ANSI SQL equivalent to the GETDATE function.

Pivoting data in SQL Server

Very often there is a need to pivot (cross-tab) normalized data for some reporting purposes. While this is best done with reporting tools (Excel is one example with powerful pivoting capabilities), sometimes it needs to be done on the database side.

The discussion here is limited to static pivoting (that is when the values to pivot are known). Dynamic pivoting is accomplished by using dynamic SQL and is based on the same techniques, just dynamically building the pivoting query.

Here are a few methods and techniques to implement that in SQL Server.

Given the following data:

 OrderId    OrderDate  Amount
----------- ---------- ------
1 2007-01-01 10.50
2 2007-01-26 12.50
3 2007-01-30 12.00
4 2007-02-14 13.75
5 2007-02-20 10.00
6 2007-03-06 15.00
7 2007-03-10 17.50
8 2007-03-29 20.00

We would like to return this result set:

 OrderYear   Jan   Feb   Mar
----------- ----- ----- -----
2007 35.00 23.75 52.50

Creating the sample Orders table:

CREATE TABLE Orders (

 order_id INT IDENTITY(1, 1) NOT NULL PRIMARY KEY,

 order_date DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,

 amount DECIMAL(8, 2) NOT NULL DEFAULT 0

        CHECK (amount >= 0));

 

INSERT INTO Orders

  (order_date, amount)

SELECT '20070101', 10.50

UNION ALL

SELECT '20070126', 12.50

UNION ALL

SELECT '20070130', 12.00

UNION ALL

SELECT '20070214', 13.75

UNION ALL

SELECT '20070220', 10.00

UNION ALL

SELECT '20070306', 15.00

UNION ALL

SELECT '20070310', 17.50

UNION ALL

SELECT '20070329', 20.00;

Using CASE

CASE can be very handy for pivoting. It allows writing flexible logic and to transform the data as needed. Also, it is a very efficient method as it requires a one pass scan of the data.

SELECT DATEPART(yyyy, order_date) AS OrderYear,

       SUM(CASE WHEN DATEPART(m, order_date) = 1

                THEN amount ELSE 0 END) AS 'Jan',

       SUM(CASE WHEN DATEPART(m, order_date) = 2

                THEN amount ELSE 0 END) AS 'Feb',

       SUM(CASE WHEN DATEPART(m, order_date) = 3

                THEN amount ELSE 0 END) AS 'Mar'

FROM Orders

GROUP BY DATEPART(yyyy, order_date);

Using Matrix table

This method requires creating a matrix table with 1 for the significant columns and 0 for columns to be ignored. Then joining the matrix table to the data table based on the pivoting column will produce the results:

CREATE TABLE MonthMatrix (

  month_nbr INT NOT NULL PRIMARY KEY

            CHECK (month_nbr BETWEEN 1 AND 12),

  jan INT NOT NULL DEFAULT 0

      CHECK (jan IN (0, 1)),

  feb INT NOT NULL DEFAULT 0

      CHECK (feb IN (0, 1)),

  mar INT NOT NULL DEFAULT 0

      CHECK (mar IN (0, 1)),

  apr INT NOT NULL DEFAULT 0

      CHECK (apr IN (0, 1)),

  may INT NOT NULL DEFAULT 0

      CHECK (may IN (0, 1)),

  jun INT NOT NULL DEFAULT 0

      CHECK (jun IN (0, 1)),

  jul INT NOT NULL DEFAULT 0

      CHECK (jul IN (0, 1)),

  aug INT NOT NULL DEFAULT 0

      CHECK (aug IN (0, 1)),

  sep INT NOT NULL DEFAULT 0

      CHECK (sep IN (0, 1)),

  oct INT NOT NULL DEFAULT 0

      CHECK (oct IN (0, 1)),

  nov INT NOT NULL DEFAULT 0

      CHECK (nov IN (0, 1)),

  dec INT NOT NULL DEFAULT 0

      CHECK (dec IN (0, 1)));

 

-- Populate the matrix table

INSERT INTO MonthMatrix (month_nbr, jan) VALUES (1, 1);

INSERT INTO MonthMatrix (month_nbr, feb) VALUES (2, 1);

INSERT INTO MonthMatrix (month_nbr, mar) VALUES (3, 1);

INSERT INTO MonthMatrix (month_nbr, apr) VALUES (4, 1);

INSERT INTO MonthMatrix (month_nbr, may) VALUES (5, 1);

INSERT INTO MonthMatrix (month_nbr, jun) VALUES (6, 1);

INSERT INTO MonthMatrix (month_nbr, jul) VALUES (7, 1);

INSERT INTO MonthMatrix (month_nbr, aug) VALUES (8, 1);

INSERT INTO MonthMatrix (month_nbr, sep) VALUES (9, 1);

INSERT INTO MonthMatrix (month_nbr, oct) VALUES (10, 1);

INSERT INTO MonthMatrix (month_nbr, nov) VALUES (11, 1);

INSERT INTO MonthMatrix (month_nbr, dec) VALUES (12, 1);

 

-- Use the matrix table to pivot

SELECT DATEPART(yyyy, order_date) AS OrderYear,

       SUM(amount * jan) AS 'Jan',

       SUM(amount * feb) AS 'Feb',

       SUM(amount * mar) AS 'Mar'

FROM Orders AS O

JOIN MonthMatrix AS M

  ON DATEPART(m, O.order_date) = M.month_nbr

GROUP BY DATEPART(yyyy, order_date);

The David Rozenshtein method

This method is based on the formula: 1 – ABS(SIGN(x – y)). This formula returns 1 when x = y, and 0 otherwise. In a way it mimics the matrix table approach.

SELECT DATEPART(yyyy, order_date) AS OrderYear,

       SUM(amount * (1 - ABS(SIGN(DATEPART(m, order_date) - 1)))) AS 'Jan',

       SUM(amount * (1 - ABS(SIGN(DATEPART(m, order_date) - 2)))) AS 'Feb',

       SUM(amount * (1 - ABS(SIGN(DATEPART(m, order_date) - 3)))) AS 'Mar'

FROM Orders

GROUP BY DATEPART(yyyy, order_date);

The PIVOT operator in SQL Server 2005

SQL Server 2005 provides built-in pivoting mechanism via the PIVOT operator.

SELECT OrderYear,

       [1] AS 'Jan',

       [2] AS 'Feb',

       [3] AS 'Mar'

FROM (SELECT DATEPART(yyyy, order_date),

            DATEPART(m, order_date),

            amount

      FROM Orders) AS O (OrderYear, month_nbr, amount)

PIVOT

(SUM(amount) FOR month_nbr IN ([1], [2], [3])) AS P;

There are some other methods to implement pivoting (like using subqueries, multiple joins, or the APPLY operator in SQL Server 2005) but I am not showing examples of those as I do not find them practical, especially with large number of values to pivot.

Additional Resources:

Using PIVOT and UNPIVOT
http://msdn2.microsoft.com/en-us/library/ms177410.aspx

Passing a Variable to an IN List

Every once in a while there is a need to do something like this:

SELECT person_id, person_name

FROM MyUsers

WHERE person_id IN (@search_list);

And @search_list contains some form of a delimited list. However, this is not a supported syntax and will fail. Here is one solution to this problem:

-- Create the test table

CREATE TABLE MyUsers (

 person_id INT PRIMARY KEY,

 person_name VARCHAR(35));

 

-- Insert sample data

INSERT INTO MyUsers VALUES (1327, 'Joe');

INSERT INTO MyUsers VALUES (1342, 'John F.');

INSERT INTO MyUsers VALUES (1411, 'Mary');

INSERT INTO MyUsers VALUES (1345, 'Nancy');

INSERT INTO MyUsers VALUES (1366, 'Greg');

INSERT INTO MyUsers VALUES (1367, 'Jeff');

INSERT INTO MyUsers VALUES (1368, 'Chris');

INSERT INTO MyUsers VALUES (1369, 'John M.');

INSERT INTO MyUsers VALUES (1370, 'Peggy');

INSERT INTO MyUsers VALUES (1371, 'Samuel');

INSERT INTO MyUsers VALUES (1372, 'Tony');

INSERT INTO MyUsers VALUES (1373, 'Lisa');

INSERT INTO MyUsers VALUES (1374, 'Tom');

INSERT INTO MyUsers VALUES (1375, 'Dave');

INSERT INTO MyUsers VALUES (1376, 'Peter');

INSERT INTO MyUsers VALUES (1377, 'Jason');

INSERT INTO MyUsers VALUES (1378, 'Justin');

INSERT INTO MyUsers VALUES (1379, 'Oscar');

 

DECLARE @search_list VARCHAR(100);

 

DECLARE @delimiter CHAR(1);

 

SELECT @search_list = '1327,1342,1411',

       @delimiter = ',';

 

-- Get the users based on the delimited variable list

SELECT person_id, person_name

FROM MyUsers

WHERE person_id IN

    (SELECT SUBSTRING(string, 2, CHARINDEX(@delimiter, string, 2) - 2)

      FROM (SELECT SUBSTRING(list, n, LEN(list))

            FROM (SELECT @delimiter + @search_list + @delimiter) AS L(list),

                (SELECT ROW_NUMBER() OVER (ORDER BY person_id)

                  FROM MyUsers) AS Nums(n)

            WHERE n <= LEN(list)) AS D(string)

      WHERE LEN(string) > 1

        AND SUBSTRING(string, 1, 1) = @delimiter)

ORDER BY person_id;


-- Results person_id person_name
-------- ------------------------------
1327 Joe
1342 John F.
1411 Mary

Notes:

– This algorithm is based on walking the delimited string (could be in a table column or a variable) and parsing it into a table format that can feed the IN list.

– The use of ROW_NUMBER assumes you are using SQL Server 2005. However, you can accomplish the same if you have SQL Server 2000. What this subquery does (SELECT ROW_NUMBER() OVER (ORDER BY person_id) FROM MyUsers) is simply generating a table with numbers. In SQL Server 2000 you can create a temp table and generate sequential numbers. The only requirement is that you have more numbers than the length of the string to parse (in this case the length of ‘1327,1342,1411’). These numbers are used as an index in the walk process.

Convert Tree Structure From Nested Set Into Adjacency List

Tree structures are often represented in nested set model or adjacency list model. In the nested set model each node has a left and right, where the root will always have a 1 in its left column and twice the number of nodes in its right column. On the other side the adjacency list model uses a linking column (child/parent) to handle hierarchies.

Sometimes there is a need to convert a nested set model into an adjacency list model. Here is one example of doing that:

CREATE TABLE NestedSet (

 node CHAR(1) NOT NULL PRIMARY KEY,

 lf INT NOT NULL,

 rg INT NOT NULL);

 

INSERT INTO NestedSet VALUES ('A', 1, 8);

INSERT INTO NestedSet VALUES ('B', 2, 3);

INSERT INTO NestedSet VALUES ('C', 4, 7);

INSERT INTO NestedSet VALUES ('D', 5, 6);

 

CREATE TABLE AdjacencyList (

 node CHAR(1) NOT NULL PRIMARY KEY,

 parent CHAR(1) NULL);

 

INSERT INTO AdjacencyList

SELECT A.node,

       B.node AS parent

FROM NestedSet AS A

LEFT OUTER JOIN NestedSet AS B

  ON B.lf = (SELECT MAX(C.lf)

            FROM NestedSet AS C

            WHERE A.lf > C.lf

               AND A.lf < C.rg);


-- Results
node parent
------ --------
A NULL
B A
C A
D C

Additional resources:

Book: “Trees and Hierarchies in SQL for Smarties” by Joe Celko

Adjacency List Model
http://www.sqlsummit.com/AdjacencyList.htm

Trees in SQL
http://www.intelligententerprise.com/001020/celko.jhtml?_requestid=1266295

Row Number in SQL Server

Every once in a while there is a need to pull the rows with a row number. Here are a few solutions that I have seen to work well.

Below is a sample table with employees that has employee name and employee address columns:

CREATE TABLE Employees (

 employee_name VARCHAR(50) PRIMARY KEY,

 employee_address VARCHAR(100));

 

INSERT INTO Employees (employee_name, employee_address)

VALUES ('Blake Anderson', '2048 River View Rd.');

INSERT INTO Employees (employee_name, employee_address)

VALUES ('Ana Williams', '9055 East Blvd.');

INSERT INTO Employees (employee_name, employee_address)

VALUES ('Robert Schmidt', '3400 Windsor Street');

INSERT INTO Employees (employee_name, employee_address)

VALUES ('Sarah Reese', '1045 Coral Rd.');

SQL Server 2000 and SQL Server 2005

Using an IDENTITY column and a temporary table

This solution is based on creating a temporary table with IDENTITY column used to provide a row number. This approach provides very good performance. Here are the steps:

-- Create the temp table

CREATE TABLE #EmployeeRowNumber (

 rn INT IDENTITY (1, 1),

 employee_name VARCHAR(50),

 employee_address VARCHAR(100));

 

-- Generate the row number

-- To achieve an ordered list the names are sorted

INSERT #EmployeeRowNumber (employee_name, employee_address)

SELECT employee_name, employee_address

FROM Employees

ORDER BY employee_name;

 

-- Select the row number

SELECT employee_name, employee_address, rn

FROM #EmployeeRowNumber

ORDER BY rn;



-- Results

employee_name employee_address rn
---------------- ------------------------ -----------
Ana Williams 9055 East Blvd. 1
Blake Anderson 2048 River View Rd. 2
Robert Schmidt 3400 Windsor Street 3
Sarah Reese 1045 Coral Rd. 4

Using a subquery to count the number of rows

This solution is based on using a subquery on a unique column (or combination of columns) to count the number of rows. Here is how it looks with the sample data provided above:

SELECT employee_name, employee_address,

      (SELECT COUNT(*)

       FROM Employees AS E2

       WHERE E2.employee_name <= E1.employee_name) AS rn 

FROM Employees AS E1 

ORDER BY employee_name;

If the values in the column are not unique then duplicate row numbers will be generated. That can be resolved by adding a tiebreaker column that will guarantee the uniqueness. This approach is a slower method than using an IDENTITY column and a temporary table. Since it will incur (n + n2) /2 row scans it may not be practical to use on a large table.

SQL Server 2005

Using the ROW_NUMBER function

In SQL Server 2005 the new function ROW_NUMBER provides the fastest approach to solve the problem:

SELECT employee_name, employee_address,

       ROW_NUMBER() OVER(ORDER BY employee_name) AS rn

FROM Employees

ORDER BY employee_name;

Additional resources:

How to dynamically number rows in a SELECT Transact-SQL statement: http://support.microsoft.com/kb/186133

SQL Server 2005 Ranking Functions: http://msdn2.microsoft.com/en-us/library/ms189798.aspx

Book: “Inside Microsoft SQL Server 2005: T-SQL Querying” by Itzik Ben-Gan, Lubor Kollar and Dejan Sarka