Anatomy of a Query

To write a good and correct query it is very important to understand the logical processing of a query. It is sad that very few SQL programming manuals start with that. Here is a look at the insides of logical query processing.

First, to start with a note: the logical processing of a query does not guarantee that the query will be executed that way. Since the cost based optimization of SQL Server goes through a few steps to find the optimal execution plan, it may choose to process the query different way that the sequence of steps below. Examples are when a WHERE predicate is applied in a JOIN, or when it is pushed down to the HAVING clause.

The logical query processing determines logically what the final result set would look like, regardless of how the database engine will manage to generate that result physically.

Here are the phases of the logical query processing in order of processing:

FROM: This is where it all starts! The FROM defines the source table(s) for the processing. If more than one table is included, then a Cartesian product (cross join) is performed between the first two tables. The result of the Cartesian product is every possible combination of a row from the first table and a row from the second table. That is, if one of the tables has M rows and the other one has N rows then the result set will be M x N rows.

ON: Next, the ON filter is applied to the result set. Only rows that match the join condition are selected.

OUTER: If an OUTER join is used, then rows from the preserved table for which a match was not found are added back. In INNER joins this phase is skipped. If more than two tables are in the FROM clause, then FROM/ON/OUTER are applied between the current result set and the next table, until all tables are processed.

WHERE: The next step is to apply the WHERE filter to the result set. Rows that satisfy the where conditions are selected. Only conditions that evaluate to TRUE (not UNKNOWN or FALSE) will pass.

GROUP BY: The result set rows are divided in groups based on the column list specified in GROUP BY. Rows of the grouped set must be only grouping columns, aggregate functions (i.e. COUNT, SUM, AVG, MIN, MAX), function or constants, and an expression made up of the first three items. Important to note here is that NULL values are considered as equal and grouped into one group.

HAVING: Next, the HAVING filter is applied. Only groups that match the having conditions are selected. If a group doesn’t satisfy the conditions, then the whole group is removed from the result set. If there is no GROUP BY the entire result set is treated as one group.

SELECT: Here it is, the SELECT list! The SELECT list is processed down here, even that it is the first line in the query. At this time column aliases are added. Steps prior to this in the list cannot use the column aliases, only the following steps will see the aliases. This is the step that will define the columns in the result set.

If there is a SELECT DISTINCT, then the duplicate rows are removed. Just like in GROUP BY, NULLs are treated as matching.

ORDER BY: The rows of the result set are sorted according to the column list specified in the ORDER BY clause. Only using ORDER BY can guarantee a sort order for rows. Otherwise the tables are unordered sets. At this step the result set is transformed to a cursor. Also, NULLs are considered equal for sorting. The ANSI standard doesn’t allow to order by columns that are not included in the SELECT list, but SQL Server allows doing that (even by expressions based on those columns). Columns in the ORDER by can be referred to by the alias or by their ordinal number.

A couple more notes. The CUBE & ROLLUP are processed after GROUP BY. At that point the super-groups are added to the result set. Also, TOP which is T-SQL specific is processed after ORDER BY, and there the specified number or percentage or rows are selected in the final result set.

For nested queries, the innermost queries can reference columns and tables in the queries in which they are contained.

Bulk Loading Images in SQL Server

Loading images and any binary files (like Adobe PDF documents) to a database is not always the best option but sometimes needed. In SQL Server 2005 this process is simplified a lot with the BULK option of OPENROWSET. Here is an example of inserting image data into a VARBINARY(MAX) column (the same applies to loading an Adobe PDF file or any other binary file):

CREATE TABLE Foobar (

  image_data VARBINARY(MAX));

 

INSERT INTO Foobar (image_data)

SELECT image_data

FROM OPENROWSET(

        BULK N'C:image.jpg',

        SINGLE_BLOB) AS ImageSource(image_data);

Note the use of the SINGLE_BLOB option.

Keeping History Data in SQL Server

Very often there is the need to archive data on daily basis. Just had that question today and here is trimmed down solution based on a recent project.

The task is to keep history data on daily basis of changed data only (that is only rows that have changed for the last day). The scenario is a Loan table with loan number and loan amount. If there is change in any of the columns in the table, that change needs to be inserted into a history table. Further, the history table should be able to provide a convenient way to query the latest data (most recent updated row) for reporting purposes.

Here is the code to that solution:

-- Create sample tables.

 

CREATE TABLE Loans (

  loan_nbr INTEGER NOT NULL PRIMARY KEY,

  loan_amount DECIMAL(12, 2) DEFAULT 0.0 NOT NULL);

 

CREATE TABLE LoansHistory (

  loan_nbr INTEGER NOT NULL,

  loan_amount DECIMAL(12, 2) NOT NULL,

  change_date DATETIME

              DEFAULT CAST(DATEDIFF(day, 0, CURRENT_TIMESTAMP) AS DATETIME)

              NOT NULL,

  PRIMARY KEY (loan_nbr, change_date),

  archive_date DATETIME NULL,

  CHECK (change_date < archive_date));

 

GO

 

-- Sample data.

 

INSERT INTO Loans VALUES (1, 100.00);

INSERT INTO Loans VALUES (2, 150.00);

INSERT INTO Loans VALUES (3, 120.00);

INSERT INTO Loans VALUES (4, 160.00);

 

INSERT INTO LoansHistory VALUES (1, 100.00, '20070501', NULL);

INSERT INTO LoansHistory VALUES (2, 95.00, '20070503', NULL);

INSERT INTO LoansHistory VALUES (3, 80.00, '20070506', '20070508');

INSERT INTO LoansHistory VALUES (3, 85.00, '20070508', NULL);

 

GO

 

-- View to use for latest loans.

 

CREATE VIEW CurrentLoans (loan_nbr, loan_amount)

AS

SELECT loan_nbr, loan_amount

FROM LoansHistory

WHERE archive_date IS NULL;

 

GO

 

-- Insert changed rows.

 

-- SQL Server 2005.

 

INSERT INTO LoansHistory (loan_nbr, loan_amount)

SELECT loan_nbr, loan_amount

FROM Loans

EXCEPT

SELECT loan_nbr, loan_amount

FROM CurrentLoans;

 

-- SQL Server 2000.

 

INSERT INTO LoansHistory (loan_nbr, loan_amount)

SELECT loan_nbr, loan_amount

FROM ( SELECT 'Live' AS Source,

              loan_nbr,

              loan_amount 

       FROM Loans 

       UNION ALL

       SELECT 'History' AS Source,

              loan_nbr,

              loan_amount 

       FROM CurrentLoans ) AS L

GROUP BY loan_nbr, loan_amount

HAVING COUNT(*) = 1

   AND MIN(Source) = 'Live';

 

-- Update archive date for old history.

-- Runs in the same transaction as the insert of new history.

 

UPDATE LoansHistory

SET archive_date = CAST(DATEDIFF(day, 0, CURRENT_TIMESTAMP) AS DATETIME)

WHERE archive_date IS NULL

  AND EXISTS (SELECT *

              FROM LoansHistory AS H

              WHERE H.loan_nbr = LoansHistory.loan_nbr

                AND H.change_date > LoansHistory.change_date);

 

GO

 

-- Use the view for all reporting purposes.

 

SELECT loan_nbr, loan_amount

FROM CurrentLoans

ORDER BY loan_nbr;

 

-- Loans as of 2007-05-08.

 

SELECT loan_nbr, loan_amount

FROM LoansHistory

WHERE change_date <= '20070508'

  AND COALESCE(archive_date, CURRENT_TIMESTAMP) > '20070508';

The process is fairly simple with keeping two dates: change date for when the last change occurred (and that is the date when the row was inserted in history), and archive date which indicates that this row is archived and there is another more recent change. If the archive data is NULL it indicates that this is the most recent change. To simplify the querying process both dates have time portion set to midnight.

Cleaning Data with Recursive CTE

SQL Server 2005 added a great new feature: Common Table Expressions (CTE). And even better than that – recursive CTEs. That provides a new powerful tool to solve many SQL problems. One of the areas where recursive CTEs shine is the hierarchical data management.

Here is another side of the recursive CTEs – utilizing them for some common tasks like cleaning data. The problem: a table has a column with values that have invalid characters. The task is to replace all those invalid characters with a space. Unfortunately the REPLACE function does not support pattern matching and each character in the column has to be verified individually and replaced if it falls in the invalid range. The solution below utilizes a recursive CTE to walk though the ACSII table of characters and to replace the invalid characters in the column values.

-- Create test table.

 

CREATE TABLE Foobar (

  key_col INT PRIMARY KEY,

  text_col NVARCHAR(100));

 

-- Populate sample data.

 

INSERT INTO Foobar VALUES (1, N'ABC!@#%DEFgh');

INSERT INTO Foobar VALUES (2, N'~!102WXY&*()_Z');

 

-- Perform the cleanup with recursive CTE.

 

WITH Clean (key_col, text_col, ch)

AS

(SELECT key_col,

        REPLACE(text_col, CHAR(255), ' '),

        255

 FROM Foobar

 UNION ALL

 SELECT key_col,

        CASE WHEN

            CHAR(ch - 1) NOT LIKE '[A-Z]'

            THEN REPLACE(text_col, CHAR(ch - 1), ' ')

            ELSE text_col END,

        ch - 1

 FROM Clean

 WHERE ch > 1)

SELECT key_col, text_col

FROM Clean

WHERE ch = 1

OPTION (MAXRECURSION 255);

On a side note – the recursive CTEs are not the best performers. Also, by default a CTE allows only 100 levels of recursion. The MAXRECURSION hint can be used to set higher level (a value between 0 and 32767; setting to 0 will remove the limit). Be aware that settings MAXRECURSION to 0 may create an infinite loop.

Here is a different method using utility table with numbers and FOR XML PATH, which is more effective:

WITH Clean (key_col, text_col)

AS

(SELECT key_col, REPLACE(CAST(

        (SELECT CASE

                  WHEN SUBSTRING(text_col, n, 1) LIKE '[A-Z]'

                  THEN SUBSTRING(text_col, n, 1)

                  ELSE '.'

                END

        FROM (SELECT number

               FROM master..spt_values

               WHERE type = 'P'

                AND number BETWEEN 1 AND 100) AS Nums(n)

        WHERE n <= LEN(text_col)

        FOR XML PATH('')) AS NVARCHAR(100)), '.', ' ')

 FROM Foobar)

SELECT key_col, text_col

FROM Clean;