Posts

T-SQL Enhancements in SQL Server 2011 (CTP1)

SQL Server 2011 (code named Denali) CTP1 was announced in November 2010 during the SQL PASS Summit in Seattle. While a bit disappointing not to see the much anticipated full implementation of the window functions (hope we will still see that in a future CTP version), it offers some interesting new programmability features. These new enhancements address specific problems that we see very often in business applications. Here is a quick look at the key new features in T-SQL.

Feature: OFFSET
Application use: paging
Comments: provides simplified syntax and efficient method for data paging solutions

Listing 1: OFFSET example
CREATE TABLE Customers (
customer_nbr INT NOT NULL PRIMARY KEY,
customer_name VARCHAR(35) NOT NULL);

INSERT INTO Customers
VALUES
(1, 'Joe'),
(
2, 'John'),
(
3, 'Jane'),
(
4, 'Peter'),
(
5, 'Mary'),
(
6, 'Jose'),
(
7, 'Daniel'),
(
8, 'Adam'),
(
9, 'Chris'),
(
10, 'Tom'),
(
11, 'Evan'),
(
12, 'Lora');

SELECT customer_name
FROM Customers
ORDER BY customer_name;

/*

customer_name
---------------
Adam
Chris
Daniel
Evan
Jane
Joe
John
Jose
Lora
Mary
Peter
Tom

*/

DECLARE @page_nbr INT = 1, @page_size INT = 5;

-- first page
SELECT customer_name
FROM Customers
ORDER BY customer_name
OFFSET
(@page_nbr - 1) * @page_size ROWS FETCH NEXT @page_size ROWS ONLY;

/*

customer_name
---------------
Adam
Chris
Daniel
Evan
Jane

*/

-- second page
SET @page_nbr = 2;

SELECT customer_name
FROM Customers
ORDER BY customer_name
OFFSET
(@page_nbr - 1) * @page_size ROWS FETCH NEXT @page_size ROWS ONLY;

/*

customer_name
---------------
Joe
John
Jose
Lora
Mary

*/

-- last page
SELECT customer_name
FROM Customers
ORDER BY customer_name
OFFSET
((SELECT COUNT(*) FROM Customers) / @page_size) * @page_size
ROWS FETCH NEXT @page_size ROWS ONLY;

/*

customer_name
---------------
Peter
Tom

*/

-- any 3 customers
SELECT customer_name
FROM Customers
ORDER BY (SELECT NULL)
OFFSET 0 ROWS FETCH NEXT 3 ROWS ONLY;

/*

customer_name
---------------
Joe
John
Jane

*/

DROP TABLE Customers;

Feature: THROW
Application use: error handling
Comments: allow to re-throw the original error

Listing 2: THROW example
BEGIN TRY    

  -- inside code THROW is similar to RAISERROR with limitations
  
THROW 51000, 'User error.', 1;

END TRY

BEGIN CATCH    

   -- inside CATCH rethrow the error
  
THROW;
  
END CATCH

Feature: SEQUENCE
Application use:  replacement for IDENTITY
Comments: ANSI standard method for sequences, improves on shortcomings of IDENTITY

Listing 3: SEQUENCE example
CREATE TABLE Customers (
customer_nbr INT NOT NULL PRIMARY KEY,
customer_name VARCHAR(35) NOT NULL);

-- create sequence starting with value 1, minimum value 1,
-- no maximum value, and increment by 1
CREATE SEQUENCE CustomerNbr AS INT    
MINVALUE 1    
NO MAXVALUE    
START
WITH 1
INCREMENT
BY 1;

-- generate customer numbers based on the sequence
-- using the NEXT VALUE FOR function
INSERT INTO Customers (customer_nbr, customer_name)
SELECT NEXT VALUE FOR CustomerNbr, 'Joe' UNION ALL
SELECT NEXT VALUE FOR CustomerNbr, 'John' UNION ALL
SELECT NEXT VALUE FOR CustomerNbr, 'Jane' UNION ALL
SELECT NEXT VALUE FOR CustomerNbr, 'Peter' UNION ALL
SELECT NEXT VALUE FOR CustomerNbr, 'Mary';

SELECT customer_nbr, customer_name
FROM Customers
ORDER BY customer_nbr;

/*

customer_nbr customer_name
------------ ---------------
1            Joe
2            John
3            Jane
4            Peter
5            Mary

*/

-- use OVER to generate next set of sequence numbers
-- based on ordering by customer name
INSERT INTO Customers (customer_nbr, customer_name)
SELECT NEXT VALUE FOR CustomerNbr OVER(ORDER BY customer_name), customer_name
FROM (
VALUES ('Jose'), ('Daniel'), ('Adam'), ('Chris'), ('Tom')) AS T(customer_name);

SELECT customer_nbr, customer_name
FROM Customers
ORDER BY customer_nbr;

/*

customer_nbr customer_name
------------ ---------------
1            Joe
2            John
3            Jane
4            Peter
5            Mary
6            Adam
7            Chris
8            Daniel
9            Jose
10           Tom

*/

-- alter sequence to set next number to 20
-- and increment by 10
ALTER SEQUENCE CustomerNbr    
RESTART
WITH 20    
INCREMENT
BY 10;

INSERT INTO Customers (customer_nbr, customer_name)
SELECT NEXT VALUE FOR CustomerNbr, 'Evan' UNION ALL
SELECT NEXT VALUE FOR CustomerNbr, 'Lora';

SELECT customer_nbr, customer_name
FROM Customers
ORDER BY customer_nbr;

/*

customer_nbr customer_name
------------ ---------------
1            Joe
2            John
3            Jane
4            Peter
5            Mary
6            Adam
7            Chris
8            Daniel
9            Jose
10           Tom
20           Evan
30           Lora

*/

-- reset sequence to 1
ALTER SEQUENCE CustomerNbr    
RESTART
WITH 1    
INCREMENT
BY 1;

DECLARE @first_value SQL_VARIANT, @last_value SQL_VARIANT;

-- get a range of 5 sequence values: 1, 2, 3, 4, 5
-- next available value is 6
EXEC sp_sequence_get_range
  
@sequence_name = N'CustomerNbr',
  
@range_size = 5,
  
@range_first_value = @first_value OUTPUT,
  
@range_last_value = @last_value OUTPUT;

SELECT @first_value AS range_first_value,
      
@last_value AS range_last_value;

/*

range_first_value   range_last_value
------------------- ------------------
1                   5

*/

SELECT NEXT VALUE FOR CustomerNbr AS next_value;

/*

next_value
-----------
6

*/

DROP TABLE Customers;
DROP SEQUENCE CustomerNbr;

Feature: EXECUTE WITH RESULT SETS
Application use: manipulate stored procedure output result set
Comments: capabilities to rename output result set columns without changing the original stored procedure; no options to remove/add columns or remove a result set when multiple result sets are returned

Listing 4: EXECUTE WITH RESULT SETS example
CREATE PROCEDURE CalculateSales
AS

SELECT sale_month, sale_amount
FROM (VALUES('2010-01', 120.50),
             (
'2010-02', 214.00),
             (
'2010-03', 109.10)) AS T(sale_month, sale_amount);

SELECT SUM(sale_amount) AS total_sales
FROM (VALUES('2010-01', 120.50),
             (
'2010-02', 214.00),
             (
'2010-03', 109.10)) AS T(sale_month, sale_amount);

GO

EXECUTE CalculateSales;

/*

sale_month sale_amount
---------- -------------
2010-01    120.50
2010-02    214.00
2010-03    109.10

(3 row(s) affected)

total_sales
-------------
443.60

(1 row(s) affected)

*/

EXECUTE CalculateSales
WITH RESULT SETS
(
    (
month CHAR(7), amount DECIMAL(10, 2)),

    (total DECIMAL(10, 2))
);

/*

month   amount
------- --------
2010-01 120.50
2010-02 214.00
2010-03 109.10

(3 row(s) affected)

total
--------
443.60

(1 row(s) affected)

*/

DROP PROCEDURE CalculateSales; </br/>

Feature: describe result sets
Application use: determining the format of a response without actually running the query
Comments: replaces SET FMTONLY

Listing 5: Describe result sets example
CREATE PROCEDURE CalculateSales
AS

SELECT sale_month, sale_amount
FROM (VALUES('2010-01', 120.50),
             (
'2010-02', 214.00),
             (
'2010-03', 109.10)) AS T(sale_month, sale_amount);

SELECT SUM(sale_amount) AS total_sales
FROM (VALUES('2010-01', 120.50),
             (
'2010-02', 214.00),
             (
'2010-03', 109.10)) AS T(sale_month, sale_amount);

GO

-- replacement for SET FMTONLY
EXECUTE sp_describe_first_result_set N'CalculateSales';

/*

abbreviated results

is_hidden column_ordinal name         system_type_name  
--------- -------------- ------------ -----------------
0         1              sale_month   varchar(7)            
0         2              sale_amount  numeric(5,2)          

*/

-- use DMV
SELECT name, system_type_name
FROM sys.dm_exec_describe_first_result_set(N'CalculateSales', 1, 1);

CREATE TABLE Foo (
keycol INT NOT NULL PRIMARY KEY,
datacol CHAR(10));

EXEC sp_describe_undeclared_parameters N'SELECT datacol FROM Foo WHERE keycol = @x';

/*

abbreviated results

parameter_ordinal name  suggested_system_type_name  
----------------- ----- ----------------------------
1                 @x    int                          

*/

GO

DROP TABLE Foo;

DROP PROCEDURE CalculateSales; </br/>

Bonus feature (maybe): FORMATMESSAGE
Application use: format messages (C/C++ sprint style)
Comments: undocumented feature allows to format message that is not in sys.messages

Listing 6: FORMATMESSAGE example
SELECT FORMATMESSAGE('There are %d products in department %s.', 10, 'remodeling');

/*

There are 10 products in department remodeling.

*/ </br/>

The Power of Simplicity

Solving a problem very often results in unnecessary complex solutions. One of the first lessons I learned from my math teacher was to scrap any solution that exceeds a page. She would urge me to start all over and look for simpler way to resolve it. In her world there was always a short and simple solution, it was only a matter of seeing it.
     I find this rule applicable to any type of problem. Only the dimension of the page size changes according to the subject matter. Many believe that finding the simpler and better solution is to “think outside the box”. But in my opinion it is exactly the opposite – to think inside the box. Know the fundamentals of your area of expertise, systematically apply them, and you will find a simple and elegant solution! Isaac Newton did not just discover the gravity when an apple fell on his head (if at all it did). It took him 20 years of hard work to explain gravity!
     In the world of SQL it drills down to deep understanding of the set based nature of SQL and coming up with solution based on that. Thinking like a procedural programmer will not help.

What are the rules to simplicity? There are no rules! I like to use the following quotes as guidelines:

The simplest way to achieve simplicity is through thoughtful reduction.
   John Maeda, The Laws of Simplicity

Make everything as simple as possible, but not simpler.
   Albert Einstein

Let’s illustrate this with one example in SQL. In out sample scenario the request is to retrieve a list of customers who always order the exact same product (regardless of what the product is).  This is a very valid business problem because you may want to send targeted coupons to customers who always buy the same products.
     There are different ways to solve this problem and Listing 1 shows one method. It is very close to describing the solution in plain English: select all customers where the customer has no other orders with different product SKU.

Listing 1
SELECT DISTINCT customer_nbr
FROM Orders AS O
WHERE NOT EXISTS(SELECT *
                
FROM Orders AS O1
                
WHERE O1.customer_nbr = O.customer_nbr
                  
AND O1.sku <> O.sku);

Is this the simplest way to solve the problem? This query is set based but still in a way mimics procedural thinking – examine all other customer orders and check that there is no other order with different product SKU.
     If you think about the set of all customer orders, you will notice that these that we need have repeating attribute values, that is the same product SKU. Applying the MIN and MAX aggregate functions on that attribute will return the same value. Then here is our simplified solution: retrieve all customers that have equal MIN and MAX product SKU on all orders. Listing 2 shows the query.

Listing 2
SELECT customer_nbr
FROM Orders
GROUP BY customer_nbr
HAVING MIN(sku) = MAX(sku);

This is more elegant and simpler solution!  Thinking more about the set of all customer orders you will notice that the distinct count of product SKUs is 1 for the customers in the needed result set. That brings us to another elegant solution:

Listing 3
SELECT customer_nbr
FROM Orders
GROUP BY customer_nbr
HAVING COUNT(DISTINCT sku) = 1;

Pretty and simple, right? Try it next time when you see that two page query!

It’s a Matter of Style

W riting SQL can be very enjoyable activity. Reading SQL can be also enjoyable (maybe like reading poetry to some), or very unpleasant… How do you write SQL with style that results in eye pleasing and easy to read/understand code? And does it matter?
    Sometimes code writing drills down to concentrating on the task at hand and producing a brilliant piece of code, which looks like this:

Listing 1
select c.customer_name, o.order_amt,
d.qty from customers c left outer join
orders o on c.customer_nbr = o.customer_nbr
left outer join orderdetails d on d.order_nbr =
o.order_nbr and d.sku = 101

Or maybe like this:

Listing 2
SELECT C.CUSTOMER_NAME, O.ORDER_AMT,
D.QTY FROM CUSTOMERS C LEFT OUTER JOIN
ORDERS O ON C.CUSTOMER_NBR = O.CUSTOMER_NBR
LEFT OUTER JOIN ORDERDETAILS D ON D.ORDER_NBR =
O.ORDER_NBR AND D.SKU = 101

While this code performs exceptionally and solves the problem in a very clever way, is it really that good? What happens when the code review/test team gets their turn? Or when you/someone else has to modify it two years from now? To my opinion this code is a very long way from what a real production code should be. And yes, this is very real and it happens every day, even as I type this. Just pay attention on the next code review, or take a look at any online SQL forum (and no, it is not only the people that ask questions, unfortunately  many SQL gurus that know it all would post an answer with similar “example” style).
    How do you make this code look better? The answer is in the four basic principles of design: contrast, repetition, alignment, and proximity. Let’s look how applying these principles of design (which many think are applicable only to graphic design) can lead to stylish and enjoyable code.

Contrast
The idea is to use contrast for elements that a very different. One example is columns and reserved keyword. They are not the same and the code should make that distinction very clear. Let’s apply that:

Listing 3
SELECT C.customer_name...

Here the SELECT keyword is capitalized to differentiate from the lower case column name. Also, the table alias is capitalized to indicate clearly the table source.

Repetition
Repeating the same element styles for all similar items adds consistency and organization throughout code. For example, repeat and maintain capitalization for all keyword, do not mix style in different context of the code. Like the style of the SELECT and FROM reserved keywords in Listing 4.

Listing 4
SELECT C.customer_name... FROM Customers AS C...

This allows to “visualize” the shape of the query code. Now the eye can easily flow from one section of code to the next one and concentrate on each element.

Alignment
Code elements should not be placed arbitrarily on the lines of code. Every code item should have some visual connection with another item in the code.  One example is aligning the start of each clause of the query (SELECT, FROM, WHERE, etc.) on a new line:

Listing 5
SELECT C.customer_name...
FROM Customers AS C...

Alignment creates a clean and pleasing look of the code structure.

Proximity
Code items that relate to each other should be grouped close together. When several items are in close proximity they become one visual unit. Like placing SELECT and column names together on the line, similar for FROM and table names or WHERE and predicates. Listing 6 demonstrates this.

Listing 6
SELECT C.customer_name, O.order_amt...
FROM Customers AS C
LEFT OUTER JOIN Orders AS O...

This makes the code structure very clear and eliminates clutter.

Let’s apply all four principles to the initial query. Here is one way it may look:

Listing 7
SELECT C.customer_name, O.order_amt, D.qty
FROM Customers AS C
LEFT OUTER JOIN Orders AS O
 
ON C.customer_nbr = O.customer_nbr
LEFT OUTER JOIN OrderDetails AS D
 
ON D.order_nbr = O.order_nbr
 AND D.sku = 101;

I added a couple extra styling elements (compared to the original query), can you catch them?

Another form of alignment is this:

Listing 8
         SELECT C.customer_name, O.order_amt, D.qty
          
FROM Customers AS C
LEFT OUTER JOIN Orders AS O
            
ON C.customer_nbr = O.customer_nbr
LEFT OUTER JOIN OrderDetails AS D
            
ON D.order_nbr = O.order_nbr
           
AND D.sku = 101;

There are many different ways to style your SQL. You may agree or disagree with some elements, but the bottom line is this: style matters!

Refactoring Entity-Attribute-Value Design

Entity-Attribute-Value (often referenced as EAV) table design is one of the “wonderful” things that newcomers to the SQL land discover very quickly! Traditional procedural programmers have very hard time understanding why tables have fixed number of columns, and adding a new column requires table and possibly code change. Their great problem solving skills from the procedural world help them find a “natural” solution – the EAV design – where a pair of key (describing the attribute) and value allow to store data with flexibility.
    Example is storing all employee information in 3 columns: key column to identify employee (entity), attribute key column (attribute) to identify the attribute stored (name, birth date, SSN, date of hire, etc.), and value column (value) to store the actual value.

More on EAV…
Tony Andrews has an excellent article explaining the problems with EAV design: OTLT and EAV: the two big design mistakes all beginners make. It is worth to note that in some cases where it is required to store many attributes that change very often (new attributes added, or existing attributes deleted) and there is no need for referential or domain integrity and query logic based on these attributes then it may be a valid scenario to utilize EAV design.

Normally the value column has to be of some large length character data type, like VARCHAR(200), and data has to be casted to text when saved and then casted back to the original data type when retrieved. 
    Once all coding is done then comes the realization of how bad this approach is… And sometimes you just walk into a project where EAV has been already deployed and now it is your job to fix it. The typical signs of EAV implementation is the existence of very complex (and slow) queries to satisfy relatively trivial user requests for data and the lack of data integrity.

How do you fix EAV design?
Here is one example of EAV design and the refactoring process to correct it. In our case the information for bank loans is stored in a table representing EAV design. Listing 1 below shows the scripts to create the table and insert sample data.

Listing 1
-- EAV table
CREATE TABLE EAV_Loans (
 
loan_nbr INT NOT NULL,
 
customer_nbr INT NOT NULL,
 
code VARCHAR(30) NOT NULL,
 
value VARCHAR(200),
 
CONSTRAINT pk_eav_loans
 
PRIMARY KEY (loan_nbr, customer_nbr, code));

INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(1, 1, 'date', '20080110');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(1, 1, 'amount', '1500.00');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(1, 1, 'type', 'personal');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(2, 2, 'date', '20080215');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(2, 2, 'amount', '3500.00');
INSERT INTO EAV_Loans
(loan_nbr, customer_nbr, code, value)
VALUES(2, 2, 'type', 'personal');

SELECT loan_nbr, customer_nbr, code, value
FROM EAV_Loans;

/*

loan_nbr    customer_nbr code      value
----------- ------------ --------- ----------
1           1            amount    1500.00
1           1            date      20080110
1           1            type      personal
2           2            amount    3500.00
2           2            date      20080215
2           2            type      personal

*/

Here the pair of loan number and customer number identifies the entity. The code column stores the attribute (amount, date, and type of loan) and the value column represents that actual value.
    A quick demonstration of the problems caused by this design. Let’s write a query to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008. Here is a first attempt to satisfy the request:

Listing 2
SELECT A.loan_nbr,
      
A.customer_nbr,
      
CAST(A.value AS DATETIME) AS loan_date,
      
CAST(B.value AS DECIMAL(15, 2)) AS loan_amount
FROM EAV_Loans AS A
INNER JOIN EAV_Loans AS B
  
ON A.loan_nbr = B.loan_nbr
 
AND A.customer_nbr = B.customer_nbr
INNER JOIN EAV_Loans AS C
  
ON A.loan_nbr = C.loan_nbr
 
AND A.customer_nbr = C.customer_nbr
WHERE A.code = 'date'
 
AND CAST(A.value AS DATETIME) >= '20080101'
 
AND CAST(A.value AS DATETIME) <  '20080201'
 
AND B.code = 'amount'
 
AND CAST(B.value AS DECIMAL(15, 2)) > 1000.00
 
AND C.code = 'type'
 
AND C.value = 'personal';

One of the first issues to notice is the multiple joins to retrieve each attribute. This is because each attribute is stored in separate row and we need to put back together the customer loan info. Besides that seems the code should be OK… except one little problem! The nature of the SQL is such that nothing guarantees that the predicates in the WHERE clause will be executed in the order listed. They will be expanded in the query plan and the cast to DATETIME or DECIMAL can be applied to the loan type value which will result in conversion error. I described this in more detail in my post Predicates in SQL.
    To fix this you may be tempted to try a different approach: using table expressions to retrieve each attribute. Listing 3 shows a version of such query.

Listing 3
SELECT A.loan_nbr,
      
A.customer_nbr,
      
loan_date,
      
loan_amount
FROM (SELECT loan_nbr, customer_nbr,
            
CAST(value AS DATETIME) AS loan_date
     
FROM EAV_Loans
     
WHERE code = 'date') AS A
INNER JOIN (SELECT loan_nbr, customer_nbr,
                  
CAST(value AS DECIMAL(15, 2)) AS loan_amount
           
FROM EAV_Loans
           
WHERE code = 'amount') AS B
  
ON A.loan_nbr = B.loan_nbr
 
AND A.customer_nbr = B.customer_nbr
INNER JOIN (SELECT loan_nbr, customer_nbr,
                  
value AS loan_type
           
FROM EAV_Loans
           
WHERE code = 'type') AS C
  
ON A.loan_nbr = C.loan_nbr
 
AND A.customer_nbr = C.customer_nbr
WHERE loan_date >= '20080101'
 
AND loan_date '20080201'
 
AND loan_amount > 1000.00
 
AND loan_type = 'personal';

However, this query has the same problem because the table expressions are expanded in the query plan and nothing guarantees the order of executing the predicates. One approach to retrieve the requested data successfully is to use a pivoting technique. The query below illustrates that.

Listing 4
SELECT loan_nbr,
      
customer_nbr,
      
loan_date,
      
loan_amount
FROM (
SELECT loan_nbr,
      
customer_nbr,
      
MAX(CASE WHEN code = 'date' THEN CAST(value AS DATETIME) END),
      
MAX(CASE WHEN code = 'amount' THEN CAST(value AS DECIMAL(15, 2)) END),
      
MAX(CASE WHEN code = 'type' THEN value END)
FROM EAV_Loans
GROUP BY loan_nbr, customer_nbr
     
) AS L(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
WHERE loan_date >= '20080101'
 
AND loan_date '20080201'
 
AND loan_amount > 1000.00
 
AND loan_type = 'personal';

Here the CASE expressions guarantee that only the correct values will be casted to the specific data type. But then how efficient is this query for such a trivial request?

The refactoring process
Usually the first step of correcting an EAV design is to create a normalized table for storing the data. In our case the table structure can look like Listing 5 (note that here for clarify the loan type is spelled out as entire word; in real business scenario it may be sufficient to indicate only P or B; alternatively create table to store loan types and reference via a foreign key constraint).

Listing 5
CREATE TABLE Loans (
 
loan_nbr INT NOT NULL,
 
customer_nbr INT NOT NULL,
 
loan_date DATETIME NOT NULL,
 
loan_amount DECIMAL(15, 2) NOT NULL,
 
loan_type VARCHAR(10) NOT NULL,
 
CONSTRAINT ck_loan_type
 
CHECK (loan_type IN ('personal', 'business')),
 
CONSTRAINT pk_loans
 
PRIMARY KEY (loan_nbr));

Now each attribute is in separate column with appropriate data type. This guarantees the integrity of the data as well as now we can define constraints to restrict data values (like the CHECK constraint for loan type).
    The next step is to convert and transfer data from the old format to the new table. We already know that a pivoting technique works and can be used here. Here is the query to transfer from EAV to normalized format.

Listing 6
INSERT INTO Loans
(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)
SELECT loan_nbr,
      
customer_nbr,
      
MAX(CASE WHEN code = 'date' THEN CAST(value AS DATETIME) END),
      
MAX(CASE WHEN code = 'amount' THEN CAST(value AS DECIMAL(15, 2)) END),
      
MAX(CASE WHEN code = 'type' THEN value END)
FROM EAV_Loans
GROUP BY loan_nbr, customer_nbr;

The last step in the refactoring process is to replace the old EAV table with view for backward compatibility (if there is code referencing that original table that cannot be changed at this time).  There are different methods to accomplish this. For simplicity here we can use unpivoting using UNION and query for each attribute (note that you have to drop the original EAV table first because the view has the same name).

Listing 7
CREATE VIEW EAV_Loans
(loan_nbr, customer_nbr, code, value)
AS
SELECT
loan_nbr, customer_nbr,
      
CAST('date' AS VARCHAR(30)),
      
CONVERT(VARCHAR(200), loan_date, 112)
FROM Loans
UNION
SELECT
loan_nbr, customer_nbr,
      
CAST('amount' AS VARCHAR(30)),
      
CAST(loan_amount AS VARCHAR(200))
FROM Loans
UNION
SELECT
loan_nbr, customer_nbr,
      
CAST('type' AS VARCHAR(30)),
      
CAST(loan_type AS VARCHAR(200))
FROM Loans;

If there is any old code that references the original EAV table for data modifications you will have to add instead of triggers to the view to handle it.

Now, let’s answer the same data request to retrieve customers with personal loans over 1,000 for the period Jan 1, 2008 through Jan 31, 2008, this time using the normalized table.

Listing 8
SELECT loan_nbr, customer_nbr, loan_date, loan_amount
FROM Loans
WHERE loan_date >= '20080101'
 
AND loan_date '20080201'
 
AND loan_amount > 1000.00
 
AND loan_type = 'personal';

Simple and elegant, just as it should be!

Refactoring Ranges

Refactoring is misunderstood so many times, mostly by developers living by the motto “If it’s not broken don’t fix it”. But in most cases this is invalid argument and leads to keeping inefficient and difficult to maintain solutions around. With tools and languages evolving, as well as knowledge about specific technology, there are so many opportunities to improve existing code in a way that will make a huge difference in many aspects. As Sir Winston Churchill said “To improve is to change; to be perfect is to change often”.

One example of refactoring that can lead to better code is the problem of finding ranges of existing values in SQL. A range (or also known as “island”) refers to finding and grouping consecutive values (dates, numbers, etc.) to a row representing the start point and the end point of the range. There are many variations of this problem but in essence the logic always drills down to finding a grouping factor to collapse the values to a range.

Let’s look at one example to illustrate the refactoring process. Given Sales table containing sale transactions our goal is to find:

1). Consecutive ranges of transaction numbers (transaction numbers that do not have gaps)

2). Consecutive ranges of sale dates (sale dates with no missing dates in the range)

These results can be used for variety of reporting and data analysis purposes. Here is how the Sales table looks:

CREATE TABLE Sales (
 
transaction_nbr INT NOT NULL PRIMARY KEY,
 
sale_date DATE NOT NULL,
 
amount DECIMAL(182) NOT NULL);

INSERT INTO Sales VALUES 
(1'20091201'100.00),
(
3'20091202'15.00),
(
4'20091203'102.50),
(
5'20091204'110.00),
(
6'20091207'98.25),
(
9'20091208'20.00),
(
11'20091209'160.00),
(
12'20091211'250.00);

SELECT transaction_nbrsale_dateamount
FROM Sales;

/*

transaction_nbr sale_date  amount
--------------- ---------- ----------
1               2009-12-01 100.00
3               2009-12-02 15.00
4               2009-12-03 102.50
5               2009-12-04 110.00
6               2009-12-07 98.25
9               2009-12-08 20.00
11              2009-12-09 160.00
12              2009-12-11 250.00

*/

The requirement is to return the following two result sets:

/*

Ranges by transaction_nbr:

range_start range_end
----------- -----------
1           1
3           6
9           9
11          12

Ranges by sale_date:

range_start range_end
----------- ----------
2009-12-01  2009-12-04
2009-12-07  2009-12-09
2009-12-11  2009-12-11

*/

On versions prior to SQL Server 2005, a very common solution is to find the grouping factor for the ranges by using a subquery to find the minimum value in the range and then group all consecutive values in the range based on that. For transaction numbers the query to find grouping factor looks like this:

SELECT transaction_nbr,
      (
SELECT MIN(B.transaction_nbr)
       
FROM Sales AS B
       
WHERE B.transaction_nbr >= A.transaction_nbr
         
AND NOT EXISTS
                (
SELECT *
                 
FROM Sales AS C
                 
WHERE C.transaction_nbr B.transaction_nbr 1)) AS grp
FROM Sales AS A;

Here is the result set:

/*

transaction_nbr grp
--------------- -----------
1               1
3               6
4               6
5               6
6               6
9               9
11              12
12              12

*/

It is easy to see the groups created by the subquery. Now the task to finalize ranges is very trivial, simply grouping by the grouping factor and retrieving the MIN and MAX values in the range:

SELECT MIN(transaction_nbrAS range_start
       
MAX(transaction_nbrAS range_end
FROM (
SELECT transaction_nbr,
      (
SELECT MIN(B.transaction_nbr)
       
FROM Sales AS B
       
WHERE B.transaction_nbr >= A.transaction_nbr
         
AND NOT EXISTS
                (
SELECT *
                 
FROM Sales AS C
                 
WHERE C.transaction_nbr B.transaction_nbr 1)) AS grp
FROM Sales AS AAS T
GROUP BY grp;

This query satisfies the first task to find ranges by transaction number:

/*

range_start range_end
----------- -----------
1           1
3           6
9           9
11          12

*/

The logic for finding ranges by sale date is very similar. The only difference is that the subquery to find the grouping factor uses the date/time functions in SQL Server to check if the dates are consecutive. Here is the query solving the second task and the corresponding result set:

SELECT MIN(sale_dateAS range_start
       
MAX(sale_dateAS range_end
FROM (
SELECT sale_date,
      (
SELECT MIN(B.sale_date)
       
FROM Sales AS B
       
WHERE B.sale_date >= A.sale_date
         
AND NOT EXISTS
                (
SELECT *
                 
FROM Sales AS C
                 
WHERE C.sale_date DATEADD(DAY1B.sale_date))) AS grp
FROM Sales AS AAS T
GROUP BY grp;


/*

range_start range_end
----------- ----------
2009-12-01  2009-12-04
2009-12-07  2009-12-09
2009-12-11  2009-12-11

*/

While in this example the query does not look so difficult, with more complex scenarios it can become very difficult to understand, and performance will not be great.

SQL Server 2005 introduced the ranking functions (ROW_NUMBER, RANK, DENSE_RANK, and NTILE). This provides a new tool to solve out problem in more simplified and efficient manner. We can use a very simple math to find the grouping factor. Take a look at the following query and the results:

SELECT transaction_nbr,
       
ROW_NUMBER() OVER(ORDER BY transaction_nbrAS rk,
       
transaction_nbr ROW_NUMBER() OVER(ORDER BY transaction_nbrAS grp
FROM Sales;

/*

transaction_nbr rk                   grp
--------------- -------------------- --------------------
1               1                    0
3               2                    1
4               3                    1
5               4                    1
6               5                    1
9               6                    3
11              7                    4
12              8                    4

*/

The query simply generates rank by transaction number, and defines expression subtracting the rank from the transaction number. Observing the transaction number and the rank columns it is easy to see that transaction numbers increase with 1 when there are no gaps, while ranks always increase with 1. Subtracting sequentially increasing numbers from set of numbers without gaps results in constant number (as both sequences increase with 1). When the set of numbers has gaps the subtraction results in different number. This is the base to define the grouping factor for our ranges.

Here is the final query to solve the first task to find ranges by transaction number:

SELECT MIN(transaction_nbrAS range_start
       
MAX(transaction_nbrAS range_end
FROM (
SELECT transaction_nbr,
       
transaction_nbr ROW_NUMBER() OVER(ORDER BY transaction_nbrAS grp
FROM Sales AS AAS T
GROUP BY grp;


/*

range_start range_end
----------- -----------
1           1
3           6
9           9
11          12

*/

Utilizing the ranking functions allows using new algorithm which results in simplified and better performing solution.

The solution for finding date ranges is very similar. Here are two versions with minor differences. The first version uses the difference between a fixed date (January 1, 2000) and the sale date value. This difference will generate sequential numeric values when there are no gaps between dates and will skip numbers when gaps exist.

-- Using number as grouping factor
SELECT MIN(sale_dateAS range_start
       
MAX(sale_dateAS range_end
FROM (
SELECT sale_date,
       
DATEDIFF(DAY'20000101'sale_date) - 
       
ROW_NUMBER() OVER(ORDER BY sale_dateAS grp
FROM Sales AS AAS T
GROUP BY grp;

/*

range_start range_end
----------- ----------
2009-12-01  2009-12-04
2009-12-07  2009-12-09
2009-12-11  2009-12-11

*/

The second version subtracts days (represented by rank based on sale date) from the sale date. This in effect generates constant date when sale dates are in sequence.

-- Using date as grouping factor
SELECT MIN(sale_dateAS range_start
       
MAX(sale_dateAS range_end
FROM (
SELECT sale_date,
       
DATEADD(DAY, -ROW_NUMBER() OVER(ORDER BY sale_date), sale_dateAS grp
FROM Sales AS AAS T
GROUP BY grp;

The same technique can be used in many different scenarios. In more complex cases (like when partitioning by a column is required) there may be a need for multiple ranking functions. Here is one example:

Grouping with ROW_NUMBER
http://www.tangrainc.com/blog/2008/03/grouping-with-row_number/

Column Aggregates

We all know how to use the aggregate functions MIN, MAX, COUNT, etc. to calculate aggregates across rows. For example, using MAX we can determine the maximum value for group of rows. But what if we want to see the maximum value for each row across columns? There is no aggregate function that accepts list of columns…

Let’s look at example to illustrate this. Here is sample table with four data columns:

CREATE TABLE Foo (

 
keycol INT NOT NULL PRIMARY KEY,

 
col1 INT NOT NULL,

 
col2 INT NOT NULL,

 
col3 INT NOT NULL,

 
col4 INT NOT NULL);

 

INSERT INTO Foo VALUES(150110); 

INSERT INTO Foo VALUES(20031);

INSERT INTO Foo VALUES(30000);

INSERT INTO Foo VALUES(491228);

INSERT INTO Foo VALUES(58888);

Our goal is to calculate the maximum value for the four data columns for each row. With two or three columns it may be an easy task using a CASE expression. But adding more columns will make a very long and complex CASE expression. Is there an easier way?

One solution that provides a shortcut is utilizing the SQL Server capabilities to generate an XML result and then using XQuery to find the max value. Here is the query:

-- max across columns with XQuery

SELECT c.query('keycol').value('.''INT'AS keycol,

       
c.value('local-name(./*[data(.)=

                max(../*[not(local-name(.)="keycol") ])][1])'


                
'VARCHAR(20)'AS max_col,

       
c.value('max(*[not(local-name(.)="keycol")])''FLOAT'AS max_val

FROM (SELECT keycolcol1col2col3col4

      
FROM Foo

      
FOR XML PATHTYPEAS T(x)

CROSS 
APPLY x.nodes('row'AS R(c);

The tricky part here is the filtering of the key column from the evaluation for max. Here are the results:

keycol  max_col  max_val

------- -------- --------

1       col4     10

2       col3     3

3       col1     0

4       col3     22

5       col1     8

How about counting the number of none zero values across columns? Here is a solution for that using very similar query:

-- count non-zero columns with XQuery

SELECT c.query('keycol').value('.''INT'AS keycol,

       
c.value('count(*[not(local-name(.)="keycol") 

                and not(.=0)])'


               
'INT')  AS cnt_non_zero

FROM (SELECT keycolcol1col2col3col4

      
FROM Foo

      
FOR XML PATHTYPEAS T(x)

CROSS 
APPLY x.nodes('row'AS R(c);

And the results:

keycol  cnt_non_zero

------- ------------

1       3

2       2

3       0

4       4

5       4

This shows the power of XQuery to solve this problem in untraditional way. The catch? Look at the execution plan of the queries…

More efficient method to solve the first problem is to use unpivoting. Here is solution using the UNPIVOT operator (the same can be done with ANSI query using cross join with numbers table):

-- max across columns with UNPIVOT

SELECT keycolcol AS max_colval AS max_val

FROM (

SELECT keycolvalcol

       
ROW_NUMBER() OVER(PARTITION BY keycol ORDER BY val DESCcolAS rk

FROM Foo

UNPIVOT

(val FOR col IN (col1col2col3col4)) AS UAS T

WHERE rk 1;

And very similar solution using unpivoting to solve the second problem:

-- count non-zero columns with UNPIVOT

SELECT keycolCOUNT(NULLIF(val0)) AS cnt_non_zero

FROM Foo

UNPIVOT

(val FOR col IN (col1col2col3col4)) AS U

GROUP BY keycol;

Outer Joins

Outer joins are one of the most widely misunderstood table operators. As useful as they are, it is very easy to incorrectly apply predicates in outer join queries and introduce logical bugs that are very difficult to troubleshoot (or spend hours trying to figure out why a particular query does not produce the desired results). This article demonstrates common misunderstanding about outer joins and how to properly use them.

First, let’s start with explanation of the logical processing of a query (listing here only the steps related to outer join operators, for full explanation of logical query processing read Anatomy of a Query).

1). FROM/CROSS JOIN: The FROM clause is processed to identify source tables; a cross join (Cartesian product) is formed between the two tables involved in the table operator.

2). ON: The ON clause predicates are applied and only rows that satisfy the predicate(s) (for which the predicates evaluate to TRUE) are included in the temporary result set.

3). ADD OUTER ROWS: The left side table in LEFT OUTER JOIN and right side table in RIGHT OUTER JOIN (or both tables in FULL OUTER JOIN) are the preserved tables. That means all rows (and selected attribute values) from this table(s) are present in the result set after the outer join operator is applied. At this phase the non-matching rows from the preserved table(s) are added back (non-matched based on the predicates in the ON clause). The attributes for the non-matched rows from the non-preserved table are added as NULLs.

4). WHERE: The predicates in the WHERE clause are applied and only rows for which the predicates evaluate to TRUE are included in the temporary result set.

The process repeats for any other table operators in the FROM clause taking the temporary result set from the prior table operator as left input.

The key concept here is to understand well phases 2, 3, and 4, and how filtering affects the result set. To illustrate this with example, let’s create two tables with loans and customers. Each customer can have personal or business loans (or both), or no loans at all.

CREATE TABLE Loans (

loan_nbr INT NOT NULL,

customer_nbr INT NOT NULL,

loan_date DATETIME NOT NULL,

loan_amount DECIMAL(15, 2) NOT NULL,

loan_type CHAR(1) NOT NULL,

CONSTRAINT ck_loan_type

CHECK (loan_type IN ('P', 'B')), -- P=Personal; B=Business

CONSTRAINT pk_loans

PRIMARY KEY (loan_nbr));






INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(1, 1, '20080101', 1500.00, 'P');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(2, 2, '20080215', 1000.00, 'P');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(3, 1, '20080311', 5000.00, 'B');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(4, 3, '20080312', 2000.00, 'P');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(5, 4, '20080325', 1200.00, 'P');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(6, 3, '20080327', 4000.00, 'B');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(7, 5, '20080410', 3500.00, 'B');

INSERT INTO Loans

(loan_nbr, customer_nbr, loan_date, loan_amount, loan_type)

VALUES(8, 2, '20080412', 2000.00, 'P');




CREATE TABLE Customers (

customer_nbr INT NOT NULL,

customer_name VARCHAR(35),

PRIMARY KEY (customer_nbr));




INSERT INTO Customers (customer_nbr, customer_name)

VALUES(1, 'Peter Brown');

INSERT INTO Customers (customer_nbr, customer_name)

VALUES(2, 'Jeff Davis');

INSERT INTO Customers (customer_nbr, customer_name)

VALUES(3, 'Kevin Fox');

INSERT INTO Customers (customer_nbr, customer_name)

VALUES(4, 'Donna Louis');

INSERT INTO Customers (customer_nbr, customer_name)

VALUES(5, 'Greg Barrow');

INSERT INTO Customers (customer_nbr, customer_name)

VALUES(6, 'Richard Douglas');


-- Add foreign key for Loans

ALTER TABLE Loans

ADD CONSTRAINT fk_loans_customers

FOREIGN KEY (customer_nbr)

REFERENCES Customers(customer_nbr);

CASE #1: Predicates on the non-preserved table attributes

The first request is to retrieve list of all customers and only the total personal loan amount (if any) they have. The following query seems to describe the request properly:

SELECT C.customer_name,

SUM(COALESCE(L.loan_amount, 0)) AS total

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr

WHERE L.loan_type = 'P'

GROUP BY C.customer_name

ORDER BY customer_name;

However, the results do not seem correct:

customer_name    total

---------------- --------

Donna Louis 1200.00

Jeff Davis 3000.00

Kevin Fox 2000.00

Peter Brown 1500.00

Customers Greg and Richard are missing from the output. To understand the problems let’s run this step by step:

1). Cross join:

SELECT C.customer_name, L.loan_amount, L.loan_type

FROM Customers AS C

CROSS JOIN Loans AS L;

This returns Cartesian product (every possible combination of a row from the Customers table and a row from the Loans table).

2). ON predicates:

SELECT C.customer_name, L.loan_amount, L.loan_type

FROM Customers AS C

INNER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr;

This results in the following output:

customer_name   loan_amount  loan_type

--------------- ------------ ---------

Peter Brown 1500.00 P

Jeff Davis 1000.00 P

Peter Brown 5000.00 B

Kevin Fox 2000.00 P

Donna Louis 1200.00 P

Kevin Fox 4000.00 B

Greg Barrow 3500.00 B

Jeff Davis 2000.00 P

At this stage only rows that match based on the predicate for customer match are included.

3). Add outer rows:

SELECT C.customer_name, L.loan_amount, L.loan_type

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr;

The query returns this result set:

customer_name   loan_amount   loan_type

--------------- ------------- ---------

Peter Brown 1500.00 P

Peter Brown 5000.00 B

Jeff Davis 1000.00 P

Jeff Davis 2000.00 P

Kevin Fox 2000.00 P

Kevin Fox 4000.00 B

Donna Louis 1200.00 P

Greg Barrow 3500.00 B

Richard Douglas
NULL NULL

Here the outer rows are added, resulting in adding back a row for customer Richard, who has no loans at all and was excluded in the prior phase when the ON clause predicates were applied.

4). WHERE predicates:

SELECT C.customer_name, L.loan_amount, L.loan_type

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr

WHERE L.loan_type = 'P';

This result set is the base for the results in the first query attempt:

customer_name   loan_amount  loan_type

--------------- ------------ ---------

Peter Brown 1500.00 P

Jeff Davis 1000.00 P

Kevin Fox 2000.00 P

Donna Louis 1200.00 P

Jeff Davis 2000.00 P

Now it is very clear that the predicate in the WHERE clause filters the NULL for loan type for customer Richard, and customer Greg is excluded because he has only business loans. Incorrectly placing the predicate in the WHERE clause turns the outer join to inner join.

To correct this query it only requires to move the predicate on loan type from the WHERE clause to the ON clause:

SELECT C.customer_name,

SUM(COALESCE(L.loan_amount, 0)) AS total

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr

AND L.loan_type = 'P'

GROUP BY C.customer_name

ORDER BY customer_name;

This results in correct result set listing all customers and the total personal loan amount.

customer_name    total

---------------- --------

Donna Louis 1200.00

Greg Barrow 0.00

Jeff Davis 3000.00

Kevin Fox 2000.00

Peter Brown 1500.00

Richard Douglas 0.00

CASE #2: Predicates on the preserved table attributes

Let’s look at another example demonstrating how filtering affect the preserved table attributes. The request is to retrieve the total loan amount for customer Richard Douglas, even if Richard does not have any loans at all. Since it is required to return always Richard’s account info, the following query seems to satisfy the request:

SELECT C.customer_name,

SUM(COALESCE(L.loan_amount, 0)) AS total

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr

AND C.customer_name = 'Richard Douglas'

GROUP BY C.customer_name

ORDER BY customer_name;

However, the result set returns information for all customers:

customer_name   total

--------------- -----

Donna Louis 0.00

Greg Barrow 0.00

Jeff Davis 0.00

Kevin Fox 0.00

Peter Brown 0.00

Richard Douglas 0.00

As explained earlier, the reason for that is that outer rows are added back after the outer join predicates are applied, resulting in adding back all other customers. To get only Richard’s loan information requires moving the predicate filtering on customer name to the WHERE clause:

SELECT C.customer_name,

SUM(COALESCE(L.loan_amount, 0)) AS total

FROM Customers AS C

LEFT OUTER JOIN Loans AS L

ON C.customer_nbr = L.customer_nbr

WHERE C.customer_name = 'Richard Douglas'

GROUP BY C.customer_name

ORDER BY customer_name;

This results in correctly returning only Richard’s loan information:

customer_name    total

---------------- -----

Richard Douglas 0.00

Plan Freezing

Plan freezing refers to forcing a plan for specific query. There are different reasons to freeze a plan, one is to provide plan stability for solving parameter sniffing issues.

This functionality was first introduced in SQL Server 2005 via using the USE PLAN query hint or using a plan guide to force a plan. However, using this feature was rather difficult with the requirement to capture the exact SQL statement. That was normally done using SQL Profiler trace or via querying the Dynamic Management Objects. Another issue was that changing the metadata (like dropping an index that affects the plan) resulted in error when trying to use the plan guide.

SQL Server 2008 added an array of enhancements. First, it simplified the mechanism to create a plan guide from plan handle via the sp_create_plan_guide_from_handle stored procedure. This eliminates the need to extract and match the SQL statement. Another enhancement is that if the metadata changes then the plan guide is silently skipped and the query execution continues with whatever execution plan is best.

Below is example of utilizing the new capabilities in SQL Server 2008 to freeze a plan.

-- Clear procedure cache

DBCC FREEPROCCACHE;





GO



-- First query run to get plan for freezing

SET STATISTICS XML ON;





EXEC sp_executesql   

N'SELECT orderid, customerid, orderdate, shippeddate

FROM Orders

WHERE customerid = N''CACYK'''
;

   

SET STATISTICS XML OFF;



GO



-- Create plan quide based on plan 

DECLARE @plan_handle VARBINARY(64);





SELECT @plan_handle plan_handle 

FROM sys.dm_exec_query_stats AS 

CROSS APPLY sys.dm_exec_sql_text(S.sql_handleAS T

WHERE text LIKE '%Order%';





EXEC sp_create_plan_guide_from_handle 'PlanGuide2008'@plan_handle @plan_handle;





GO





-- Check that plan guide was created

SELECT FROM sys.plan_guides;





GO



-- Execute and verify UsePlan="1" and PlanGuideName="PlanGuide2008" 

-- in the XML plan output 



SET STATISTICS XML ON;





EXEC sp_executesql   

N'SELECT orderid, customerid, orderdate, shippeddate

FROM Orders

WHERE customerid = N''CACYK'''
;        

   

SET STATISTICS XML OFF;





GO





EXEC sp_control_plan_guide N'DROP'N'PlanGuide2008'

Unpivoting Multiple Columns

Unpivoting is the process of normalizing data. In earlier post I discussed unpivoting a single column. This demonstration is to show how to unpivot multiple columns.

The task is to normalize the following denormalized data, which represents product sales volumes by quarter:

product_nbr qtr1   sales_qtr1  qtr2   sales_qtr2  qtr3   sales_qtr3  qtr4   sales_qtr4

----------- ------ ----------- ------ ----------- ------ ----------- ------ -----------

1           2008Q1 100         2008Q2 20          2008Q3 15          2008Q4 10

2           2008Q1 80          2008Q2 15          2008Q3 20          2008Q4 10

3           2008Q1 70          2008Q2 5           2008Q3 10          2008Q4 15

Normalized data set should look like this:

product_nbr qtr    sales

----------- ------ -----------

1           2008Q1 100

1           2008Q2 20

1           2008Q3 15

1           2008Q4 10

2           2008Q1 80

2           2008Q2 15

2           2008Q3 20

2           2008Q4 10

3           2008Q1 70

3           2008Q2 5

3           2008Q3 10

3           2008Q4 15

The first method uses CROSS JOIN with table with numbers (needs one number for each quarter) and CASE expressions to select the appropriate value (quarter or sales volume) for each quarter.

SELECT product_nbr

       
CASE n

            
WHEN THEN qtr1

            
WHEN THEN qtr2

            
WHEN THEN qtr3

            
WHEN THEN qtr4

       
END AS qtr,

       
CASE n

            
WHEN THEN sales_qtr1

            
WHEN THEN sales_qtr2

            
WHEN THEN sales_qtr3

            
WHEN THEN sales_qtr4

       
END AS sales

FROM QuarterlySales AS S

CROSS JOIN 

(SELECT UNION 

 SELECT 
UNION

 SELECT 
UNION

 SELECT 
4AS Nums(n);

Alsmost seems natural that we should be able to do the same much easier using the UNPIVOT operator (introduced in SQL Server 2005). However, one of the limitations of the UNPIVOT operator is that it works only with a single column. But because SQL Server allows multiple table operators in the FROM clause, we can use two UNPIVOT operators. The catch is that the second UNPIVOT operator applies to the virtual table results from the first unpivot operator. That requires using a little trick to extract and match the quarter from the results of each UNPIVOT operator in order to produce the final result. Here is the query to unpivot using the UNPIVOT operator.

SELECT product_nbrqtrsales

FROM 

 
(SELECT product_nbr

         
qtr1sales_qtr1

         
qtr2sales_qtr2

         
qtr3sales_qtr3,

         
qtr4sales_qtr4

  
FROM QuarterlySalesAS S

UNPIVOT

 (qtr FOR qtrx IN (qtr1qtr2qtr3qtr4)) AS U1

UNPIVOT

 (sales FOR sales_qtrx IN (sales_qtr1sales_qtr2

                           
sales_qtr3sales_qtr4)) AS U2

WHERE RIGHT(sales_qtrx1RIGHT(qtrx1);

Pivoting on Multiple Columns

Pivoting data on more than one column is not a common request. But there are times when it is a very useful technique for reporting purposes. Here is one example to demonstrate different methods to handle that.

The goal is to report product data by pivoting columns with value and quantity for each product based on year. Here is the script to create the table and insert sample data.

CREATE TABLE Products (

product VARCHAR(30),

market_year INT,

value INT,

quantity INT,

CONSTRAINT pk_products

PRIMARY KEY (product, market_year));
INSERT INTO Products VALUES('Corn', 2003, 100, 20);
INSERT INTO Products VALUES('Corn', 2004, 200, 25);

INSERT INTO Products VALUES('Corn', 2005, 150, 30);

INSERT INTO Products VALUES('Corn', 2006, 150, 10);

The request is to produce the following output:

product v2003 q2003 v2004 q2004 v2005 q2005 v2006 q2006

------- ----- ----- ----- ----- ----- ----- ----- -----

Corn 100 20 200 25 150 30 150 10

As usual, the first suspect for pivoting solution is the CASE function. Using CASE expressions is the most flexible method to manipulate data for pivoting. There is not much difference when pivoting a single or multiple columns. Here is the solution with CASE:

SELECT product,

SUM(CASE WHEN market_year = 2003 THEN value ELSE 0 END) AS v2003,

SUM(CASE WHEN market_year = 2003 THEN quantity ELSE 0 END) AS q2003,

SUM(CASE WHEN market_year = 2004 THEN value ELSE 0 END) AS v2004,

SUM(CASE WHEN market_year = 2004 THEN quantity ELSE 0 END) AS q2004,

SUM(CASE WHEN market_year = 2005 THEN value ELSE 0 END) AS v2005,

SUM(CASE WHEN market_year = 2005 THEN quantity ELSE 0 END) AS q2005,

SUM(CASE WHEN market_year = 2006 THEN value ELSE 0 END) AS v2006,

SUM(CASE WHEN market_year = 2006 THEN quantity ELSE 0 END) AS q2006

FROM Products

GROUP BY product;

Next, let's look at the PIVOT operator that was added in SQL Server 2005. The PIVOT operator has a few limitations, and one of them is that it supports pivoting only on a single column. However, T-SQL allows to specify multiple PIVOT operators in the FROM clause, that way providing a solution for pivoting on multiple columns. There is one catch: the second PIVOT consumes the temporary result set output of the first PIVOT operator. Because of that the year column is not available (since it is used as pivoting column in the first PIVOT) and there is a need to define a virtual column that replicates the values for the year. Also, since the first PIVOT operator uses the year values as column names, the new virtual column subtracts 2000 from the year value to generate different column names for the second PIVOT operator. Here is the query using two PIVOT operators:

SELECT product,

MAX([2003]) AS v2003,

MAX([3]) AS q2003,

MAX([2004]) AS v2004,

MAX([4]) AS q2004,

MAX([2005]) AS v2005,

MAX([5]) AS q2005,

MAX([2006]) AS v2006,

MAX([6]) AS q2006

FROM (SELECT product, market_year,

market_year - 2000 AS market_year2,

SUM(value) AS value,

SUM(quantity) AS quantity

FROM Products

GROUP BY product, market_year) AS T

PIVOT

(SUM(value) FOR market_year IN

([2003], [2004], [2005], [2006])) AS P1

PIVOT

(SUM(quantity) FOR market_year2 IN

([3], [4], [5], [6])) AS P2

GROUP BY product;

In this particular case there is more elegant solution using a single PIVOT operator. With little math the value and the quantity can be combined into single column and then after the pivoting split back to two separate columns. Here is the solution using a single PIVOT operator:

SELECT product,

[2003] / 1000 AS v2003,

[2003] % 1000 AS q2003,

[2004] / 1000 AS v2004,

[2004] % 1000 AS q2004,

[2005] / 1000 AS v2005,

[2005] % 1000 AS q2005,

[2006] / 1000 AS v2006,

[2006] % 1000 AS q2006

FROM (SELECT product, market_year,

value * 1000 + quantity AS value

FROM Products) AS T

PIVOT

(SUM(value) FOR market_year IN

([2003], [2004], [2005], [2006])) AS P;