Recursive SQL Queries In Practice

Posted on December 20, 2022

⚠️ Trigger Warning ⚠️

The article contains fairly convoluted SQL queries. If you experienced SQL-related trauma you might want to skip this one.

Ctx

In a business context you will regularly have scenarios in which you need to track relationships between entities.

In our specific case we had a system whose only purpose was tracking relationships between entities. This service would regularly receive support requests where users were surprised by the data returned by this service.

It was standard practice for an engineer to take the input data from the support ticket and manually inspect the database. They would try to “figure out” what was the overall state relevant for the user queries and report back to the users. This was obviously very time consuming and distracting.

I’ll take all the context you have!

On the first opportunity of a lower-load week I decided to test that postgres feature that “you’ll for sure never use in a real-world application”.

💥 RECURSIVE SQL QUERIES 💥 (link to documentation)

The basic idea was to let the user specify “starting points” and then to recursively explore the relationship graph for any data that is in any way reachable (see: transitive closure).

Execution

For illustrative purposes we’ll use the following domain:

Table 1 `patient_met_doctor`

patient	doctor	…other fields
Sarah	dr. John	…
Mike	dr. Dre	…
Susan	dr. Frege	…

Table 2 `doctor_used_workstation`

doctor	workstation	…other fields
dr. John	station_1	…
dr. Dre	station_1	…
dr. Frege	station_3	…

The two tables together represent the following graph:

our goal is to be able to input any one of the values {"Sarah", "dr. John", "station_1", "Mike", "dr. Dre"} and always get back a list of all the four edges that connect all of them together ({A,B,D,E}).

THE Query

-- tipe - can be either 'p2d' or 'd2w'
-- lft - left identifier
-- rgt - right identifier
WITH RECURSIVE all_edges(tipe, lft, rgt) AS (
  -- INITIAL TERMS
  (
    SELECT 'p2d', patient, doctor FROM patient_met_doctor WHERE (patient = ANY($1) OR doctor = ANY($2))
    UNION
    SELECT 'd2w', doctor, workstation FROM doctor_used_workstation WHERE (doctor = ANY($2) or workstation = ANY($3))
  )
  UNION
  -- RECURSIVE TERMS
  (
    WITH
      edges AS (SELECT * FROM all_edges), -- https://dba.stackexchange.com/a/240309
      edges_p2d AS (SELECT * FROM edges WHERE tipe = 'p2d'),
      edges_d2w AS (SELECT * FROM edges WHERE tipe = 'd2w'),
      active_patients(pat_id) AS (SELECT lft FROM edges_p2d),
      active_doctors(dr_id) AS (SELECT rgt FROM edges_p2d UNION SELECT lft FROM edges_d2w),
      active_workstations(ws_id) AS (SELECT rgt FROM edges_d2w)
    -- PATIENT ADJACENT EDGES
    (SELECT 'p2d', patient, doctor FROM patient_met_doctor pmd JOIN active_patients ap ON pmd.patient = ap.pat_id)
    UNION
    -- WORKSTATION ADJACENT EDGES
    (SELECT 'd2w', doctor, workstation FROM doctor_used_workstation duw JOIN active_workstations aw ON duw.workstation = aw.ws_id)
    -- DOCTOR LEFT-ADJACENT EDGES
    UNION
    (SELECT 'p2d', patient, doctor FROM patient_met_doctor pmd JOIN active_doctors ad ON pmd.doctor = ad.dr_id)
    -- DOCTOR RIGHT-ADJACENT EDGES
    UNION
    (SELECT 'd2w', doctor, workstation FROM doctor_used_workstation duw JOIN active_doctors ad ON duw.doctor = ad.dr_id)
  )
)
SELECT * FROM all_edges;

The query accepts 3 string arrays as input. The output is an array of edges tagged with either p2d or d2w.

All code + Example invocation

create table patient_met_doctor
( patient text not null,
  doctor text not null
);

create table doctor_used_workstation
( doctor text not null,
 workstation text not null
 );

insert into patient_met_doctor(patient,doctor) values
( 'Sarah', 'dr. John'),
( 'Mike', 'dr. Dre'),
( 'Susan', 'dr. Frege');

insert into doctor_used_workstation values
('dr. John', 'station_1'),
('dr. Dre', 'station_1'),
('dr. Frege', 'station_2');

PREPARE exploreGraph(text[], text[], text[]) AS
WITH RECURSIVE all_edges(tipe, lft, rgt) AS (
  -- INITIAL TERMS
  (
    SELECT 'p2d', patient, doctor FROM patient_met_doctor WHERE (patient = ANY($1) OR doctor = ANY($2))
    UNION
    SELECT 'd2w', doctor, workstation FROM doctor_used_workstation WHERE (doctor = ANY($2) or workstation = ANY($3))
  )
  UNION
  -- RECURSIVE TERMS
  (
    WITH
      edges AS (SELECT * FROM all_edges), -- https://dba.stackexchange.com/a/240309
      edges_p2d AS (SELECT * FROM edges WHERE tipe = 'p2d'),
      edges_d2w AS (SELECT * FROM edges WHERE tipe = 'd2w'),
      active_patients(pat_id) AS (SELECT lft FROM edges_p2d),
      active_doctors(dr_id) AS (SELECT rgt FROM edges_p2d UNION SELECT lft FROM edges_d2w),
      active_workstations(ws_id) AS (SELECT rgt FROM edges_d2w)
    -- PATIENT ADJACENT EDGES
    (SELECT 'p2d', patient, doctor FROM patient_met_doctor pmd JOIN active_patients ap ON pmd.patient = ap.pat_id)
    UNION
    -- WORKSTATION ADJACENT EDGES
    (SELECT 'd2w', doctor, workstation FROM doctor_used_workstation duw JOIN active_workstations aw ON duw.workstation = aw.ws_id)
    -- DOCTOR LEFT-ADJACENT EDGES
    UNION
    (SELECT 'p2d', patient, doctor FROM patient_met_doctor pmd JOIN active_doctors ad ON pmd.doctor = ad.dr_id)
    -- DOCTOR RIGHT-ADJACENT EDGES
    UNION
    (SELECT 'd2w', doctor, workstation FROM doctor_used_workstation duw JOIN active_doctors ad ON duw.doctor = ad.dr_id)
  )
)
SELECT * FROM all_edges;

-- find any entity transitively related to the patient 'Mike'
EXECUTE exploreGraph('{"Mike"}', '{}', '{}');

Running this yields:

postgres=# \i ./query.sql 
CREATE TABLE
CREATE TABLE
INSERT 0 3
INSERT 0 3
PREPARE
 tipe |   lft    |    rgt    
------+----------+-----------
 p2d  | Mike     | dr. Dre
 d2w  | dr. Dre  | station_1
 d2w  | dr. John | station_1
 p2d  | Sarah    | dr. John
(4 rows)

Back to the real world

In our application the identifiers tend to have a disjunct format so we simply show the user a single <textarea>, treat each line as a separate input and we replicate it into each input array (even though the user looses a bit on expressibility I found the simplified UI is well worth the trade off).

In the real world things can get weird:

Results

Initially our goal was to build a tool for ourselves as we didn’t want to continously have to ask for production cluster access, but soon the users found out about it and the support requests started to regularly look like this:

(which obviously made me very happy 😊)

Screenshots from this UI are now regularly being used in chat threads to communicate even the most basic scenarios.

Did it ever go wrong?

We obviously forgot to limit the input data and one day we were notified about a query running for 6 hours.

In short it’s just a matter of time before some Björn discovers your nice little debugging tool and copies a 2k line long Excel file into the input <textarea> (in this case it happened ~2 years after the feature was released).

Recursive SQL Queries In Practice

⚠️ Trigger Warning ⚠️

Ctx