Posts

Shortest Path for Friend Connections

An interesting problem to solve is finding relation paths in contact management systems. This is a limited case of the Dijkstra’s algorithm for finding the shortest path in a graph. Here we always have cost of 1 for each path and it is a two-way relation between the nodes. To put this in human readable format, the problem is to find the distance between friends, based on relationships defined. If A is friend with B and then B is friend with C, the path between A and C is A->B->C with distance 2.

Here is one solution using recursive CTEs in SQL Server 2005. The method is based on building relationship paths between all connected nodes and then searching the path for friend connections. If this searching if frequent, then the path can be materialized in a column.

-- Sample table with data

CREATE TABLE Contacts (

 c_from CHAR(1),

 c_to CHAR(1),

 PRIMARY KEY (c_from, c_to));

 

INSERT    INTO Contacts

SELECT    'A', 'B' UNION ALL

SELECT    'B', 'D' UNION ALL

SELECT    'C', 'A' UNION ALL

SELECT    'C', 'E' UNION ALL

SELECT    'G', 'C' UNION ALL

SELECT    'B', 'G' UNION ALL

SELECT    'F', 'D' UNION ALL

SELECT    'E', 'F';

 

-- Table to store paths

CREATE TABLE Paths (c_path VARCHAR(200) PRIMARY KEY);

 

-- Recursive CTE to populate the paths

WITH PathCTE

AS

(SELECT c_from, c_to,

        CAST('.' + CAST(c_from AS VARCHAR(8)) + '.' +

        CAST(c_to AS VARCHAR(8)) + '.' AS VARCHAR(200)) AS c_path

 FROM Contacts AS C1

 UNION ALL

 SELECT C.c_from, C.c_to,

        CAST(P.c_path + C.c_to + '.' AS VARCHAR(200))

 FROM PathCTE AS P

 JOIN Contacts AS C

   ON P.c_to = C.c_from

 WHERE P.c_path NOT LIKE '%.' +

                    CAST(C.c_from AS VARCHAR(10)) +

                    '.' +

                    CAST(C.c_to AS VARCHAR(10)) +

                    '.%')

INSERT INTO Paths

SELECT c_path FROM PathCTE;

 

-- Show all paths between B and D

SELECT c_path

FROM Paths

WHERE c_path LIKE '.B.%'

  AND c_path LIKE '%.D.';

 

-- Shortest path distance, longest path distance, and number of paths

SELECT MIN(LEN(c_path) - LEN(REPLACE(c_path, '.', '')) - 2) AS shortest_distance,

       MAX(LEN(c_path) - LEN(REPLACE(c_path, '.', '')) - 2) AS longest_distance,

       COUNT(*) AS paths_cnt

FROM Paths

WHERE c_path LIKE '.B.%'

  AND c_path LIKE '%.D.';

 

-- Results

c_path

--------------

.B.D.

.B.G.C.A.B.D.

.B.G.C.E.F.D.

 

 

shortest_distance  longest_distance  paths_cnt

----------------- ---------------- -----------

1                  5                3

It is good to note that this method does not make effort to avoid reusing paths to reach a destination. If needed this can be handled by additional condition in the recursive CTE.

Cleaning Data with Recursive CTE

SQL Server 2005 added a great new feature: Common Table Expressions (CTE). And even better than that – recursive CTEs. That provides a new powerful tool to solve many SQL problems. One of the areas where recursive CTEs shine is the hierarchical data management.

Here is another side of the recursive CTEs – utilizing them for some common tasks like cleaning data. The problem: a table has a column with values that have invalid characters. The task is to replace all those invalid characters with a space. Unfortunately the REPLACE function does not support pattern matching and each character in the column has to be verified individually and replaced if it falls in the invalid range. The solution below utilizes a recursive CTE to walk though the ACSII table of characters and to replace the invalid characters in the column values.

-- Create test table.

 

CREATE TABLE Foobar (

  key_col INT PRIMARY KEY,

  text_col NVARCHAR(100));

 

-- Populate sample data.

 

INSERT INTO Foobar VALUES (1, N'ABC!@#%DEFgh');

INSERT INTO Foobar VALUES (2, N'~!102WXY&*()_Z');

 

-- Perform the cleanup with recursive CTE.

 

WITH Clean (key_col, text_col, ch)

AS

(SELECT key_col,

        REPLACE(text_col, CHAR(255), ' '),

        255

 FROM Foobar

 UNION ALL

 SELECT key_col,

        CASE WHEN

            CHAR(ch - 1) NOT LIKE '[A-Z]'

            THEN REPLACE(text_col, CHAR(ch - 1), ' ')

            ELSE text_col END,

        ch - 1

 FROM Clean

 WHERE ch > 1)

SELECT key_col, text_col

FROM Clean

WHERE ch = 1

OPTION (MAXRECURSION 255);

On a side note – the recursive CTEs are not the best performers. Also, by default a CTE allows only 100 levels of recursion. The MAXRECURSION hint can be used to set higher level (a value between 0 and 32767; setting to 0 will remove the limit). Be aware that settings MAXRECURSION to 0 may create an infinite loop.

Here is a different method using utility table with numbers and FOR XML PATH, which is more effective:

WITH Clean (key_col, text_col)

AS

(SELECT key_col, REPLACE(CAST(

        (SELECT CASE

                  WHEN SUBSTRING(text_col, n, 1) LIKE '[A-Z]'

                  THEN SUBSTRING(text_col, n, 1)

                  ELSE '.'

                END

        FROM (SELECT number

               FROM master..spt_values

               WHERE type = 'P'

                AND number BETWEEN 1 AND 100) AS Nums(n)

        WHERE n <= LEN(text_col)

        FOR XML PATH('')) AS NVARCHAR(100)), '.', ' ')

 FROM Foobar)

SELECT key_col, text_col

FROM Clean;