Using recursive CTEs for calculating taxonomic lineages
Fri 07 July 2017 — Filed under sql; tags: postgresql, taxonomyThis topic falls into the category of "things that I didn't know that I didn't know." Many of my projects involve the manipulation of taxonomies, particularly the NCBI taxonomy, but I didn't understand the efficiency with which it was possible to calculate a lineage in SQL. I recently revisited the issue, and here's what I learned.
The NCBI taxonomy is provided as a collection of tables that can be
downloaded in its entirety by FTP. There are two tables in the
taxonomy database that we care about for the purposes of this
post. The first is nodes
. The table from the NCBI taxonomy contains
some additional fields that aren't important in this context, but at
its core, it looks something like this:
create table nodes ( tax_id integer, parent integer, rank text );
The second table that we will use later is called names
, with a
schema that (in its simplified form) looks like this:
create table names ( tax_id integer, tax_name text, is_primary boolean );
Multiple synonyms are allowed for each tax_id
; the column
names.is_primary
indicates which is the definitive name for the
organism (this isn't present in the table downloaded from NCBI but is
derived from other attributes).
The taxonomic hierarchy is defined in nodes
using an adjacency list
model. It turns out that relational databases supporting recursive
queries can easily traverse the graph defined by the model for a given
node. This is well-described elsewhere (the linked post compares the
adjacency model with other representations of trees in sql).
The query that we will use is called a recursive CTE (common table expression):
WITH RECURSIVE a AS ( SELECT tax_id, parent_id, rank FROM nodes WHERE tax_id = '1280' UNION ALL SELECT p.tax_id, p.parent_id, p.rank FROM a JOIN nodes p ON a.parent_id = p.tax_id ) SELECT * FROM a;
tax_id | parent_id | rank |
---|---|---|
1280 | 1279 | species |
1279 | 90964 | genus |
90964 | 1385 | family |
1385 | 91061 | order |
91061 | 1239 | class |
1239 | 1783272 | phylum |
1783272 | 2 | below_superkingdom |
2 | 131567 | superkingdom |
131567 | 1 | below_root |
1 | root |
And voila: we have traversed the tree from leaf to root, and in doing so, the lineage is defined. I'm using postgresql for this example, but sqlite supports recursive CTEs as well, and an identical query can be used with both databases.
At this point, we can easily add taxonomic names as well:
WITH RECURSIVE a AS ( SELECT tax_id, parent_id, rank FROM nodes WHERE tax_id = '1280' UNION ALL SELECT p.tax_id, p.parent_id, p.rank FROM a JOIN nodes p ON a.parent_id = p.tax_id ) SELECT a.tax_id, a.rank, names.tax_name FROM a JOIN names USING(tax_id) WHERE is_primary;
tax_id | rank | tax_name |
---|---|---|
1280 | species | Staphylococcus aureus |
1279 | genus | Staphylococcus |
90964 | family | Staphylococcaceae |
1385 | order | Bacillales |
91061 | class | Bacilli |
1239 | phylum | Firmicutes |
1783272 | below_superkingdom | Terrabacteria group |
2 | superkingdom | Bacteria |
131567 | below_root | cellular organisms |
1 | root | root |
It's worth pausing to point out an important feature of the
representation of the adjacency list model of the nodes
table
downloaded from NCBI: the root is identified as the node for which
tax_id = 1
, but the value of parent_id
for the root node is also
1! The recursive CTE as written expects a value of NULL in the
parent_id
field of the root node; with a value of 1, it will never
terminate, and your laptop will get quite hot! So do yourself a favor
and execute the following before trying this as home:
UPDATE nodes set parent_id = NULL where tax_id = 1;
As I mentioned, all of this is pretty well documented (as I learned once I bothered to look). But what about a query for multiple tax_ids at once? Is it as easy as including more than one in the query?
WITH RECURSIVE a AS ( SELECT tax_id, parent_id, rank FROM nodes WHERE tax_id in ('562', '1280') UNION ALL SELECT p.tax_id, p.parent_id, p.rank FROM a JOIN nodes p ON a.parent_id = p.tax_id ) SELECT a.tax_id, a.rank, names.tax_name FROM a JOIN names USING(tax_id) WHERE is_primary;
tax_id | rank | tax_name |
---|---|---|
1280 | species | Staphylococcus aureus |
562 | species | Escherichia coli |
1279 | genus | Staphylococcus |
561 | genus | Escherichia |
90964 | family | Staphylococcaceae |
543 | family | Enterobacteriaceae |
1385 | order | Bacillales |
91347 | order | Enterobacterales |
91061 | class | Bacilli |
1236 | class | Gammaproteobacteria |
1239 | phylum | Firmicutes |
1224 | phylum | Proteobacteria |
1783272 | below_superkingdom | Terrabacteria group |
2 | superkingdom | Bacteria |
2 | superkingdom | Bacteria |
131567 | below_root | cellular organisms |
131567 | below_root | cellular organisms |
1 | root | root |
1 | root | root |
Hmm, no, that doesn't work so well. Even if the ordering was right, we would find it difficult to keep track of the individual lineages. So maybe we can keep track?
WITH RECURSIVE a AS ( SELECT tax_id as tid, 1 as ord, tax_id, parent_id, rank FROM nodes WHERE tax_id in ('562', '1280') UNION ALL SELECT a.tid, a.ord + 1, p.tax_id, p.parent_id, p.rank FROM a JOIN nodes p ON a.parent_id = p.tax_id ) SELECT a.tid, a.ord, a.tax_id, a.rank, names.tax_name FROM a JOIN names USING(tax_id) WHERE is_primary ORDER BY tid, ord;
tid | ord | tax_id | rank | tax_name |
---|---|---|---|---|
1280 | 1 | 1280 | species | Staphylococcus aureus |
1280 | 2 | 1279 | genus | Staphylococcus |
1280 | 3 | 90964 | family | Staphylococcaceae |
1280 | 4 | 1385 | order | Bacillales |
1280 | 5 | 91061 | class | Bacilli |
1280 | 6 | 1239 | phylum | Firmicutes |
1280 | 7 | 1783272 | below_superkingdom | Terrabacteria group |
1280 | 8 | 2 | superkingdom | Bacteria |
1280 | 9 | 131567 | below_root | cellular organisms |
1280 | 10 | 1 | root | root |
562 | 1 | 562 | species | Escherichia coli |
562 | 2 | 561 | genus | Escherichia |
562 | 3 | 543 | family | Enterobacteriaceae |
562 | 4 | 91347 | order | Enterobacterales |
562 | 5 | 1236 | class | Gammaproteobacteria |
562 | 6 | 1224 | phylum | Proteobacteria |
562 | 7 | 2 | superkingdom | Bacteria |
562 | 8 | 131567 | below_root | cellular organisms |
562 | 9 | 1 | root | root |
Ok, that's better! If we like, we can group and aggregate the lineages:
(The results are shown using the equivalent of psql -x
; the first
line just provides a header row.)
SELECT 'value' as key; WITH RECURSIVE a AS ( SELECT tax_id as tid, 1 as ord, tax_id, parent_id, rank FROM nodes WHERE tax_id in ('562', '1280') UNION ALL SELECT a.tid, a.ord + 1, p.tax_id, p.parent_id, p.rank FROM a JOIN nodes p ON a.parent_id = p.tax_id ) SELECT a.tid, array_agg(a.tax_id), array_agg(a.rank), array_agg(names.tax_name) FROM a JOIN names USING(tax_id) WHERE is_primary GROUP BY tid;
key | value |
---|---|
tid | 1280 |
array_agg | {1280,1279,90964,1385,91061,1239,1783272,2,131567,1} |
array_agg | {species,genus,family,order,class,phylum,below_superkingdom,superkingdom,below_root,root} |
array_agg | {"Staphylococcus aureus",Staphylococcus,Staphylococcaceae,Bacillales,Bacilli,Firmicutes,"Terrabacteria group",Bacteria,"cellular organisms",root} |
tid | 562 |
array_agg | {562,1,561,1224,543,131567,91347,2,1236} |
array_agg | {species,root,genus,phylum,family,below_root,order,superkingdom,class} |
array_agg | {"Escherichia coli",root,Escherichia,Proteobacteria,Enterobacteriaceae,"cellular organisms",Enterobacterales,Bacteria,Gammaproteobacteria} |
Here's another approach: accumulate values in an array in the recursive expression, and then use the first value in the array to identify the first tax_id in each lineage.
WITH RECURSIVE a AS ( SELECT tax_id as tid, parent_id as pid, rank, tax_id, ARRAY[tax_id] as lineage FROM nodes WHERE tax_id = '1280' UNION ALL SELECT p.tax_id, p.parent_id, p.rank, lineage[1], lineage || ARRAY[p.tax_id] FROM a JOIN nodes p ON a.pid = p.tax_id ) SELECT * FROM a;
tid | pid | rank | tax_id | lineage |
---|---|---|---|---|
1280 | 1279 | species | 1280 | {1280} |
1279 | 90964 | genus | 1280 | {1280,1279} |
90964 | 1385 | family | 1280 | {1280,1279,90964} |
1385 | 91061 | order | 1280 | {1280,1279,90964,1385} |
91061 | 1239 | class | 1280 | {1280,1279,90964,1385,91061} |
1239 | 1783272 | phylum | 1280 | {1280,1279,90964,1385,91061,1239} |
1783272 | 2 | below_superkingdom | 1280 | {1280,1279,90964,1385,91061,1239,1783272} |
2 | 131567 | superkingdom | 1280 | {1280,1279,90964,1385,91061,1239,1783272,2} |
131567 | 1 | below_root | 1280 | {1280,1279,90964,1385,91061,1239,1783272,2,131567} |
1 | root | 1280 | {1280,1279,90964,1385,91061,1239,1783272,2,131567,1} |
We only actually care about the final lineage when it terminates at the root node.
WITH RECURSIVE a AS ( SELECT tax_id as tid, parent_id as pid, rank, tax_id, ARRAY[tax_id] as lineage FROM nodes WHERE tax_id = '1280' UNION ALL SELECT p.tax_id, p.parent_id, p.rank, lineage[1], lineage || ARRAY[p.tax_id] FROM a JOIN nodes p ON a.pid = p.tax_id ) SELECT tax_id, lineage FROM a WHERE a.rank = 'root';
tax_id | lineage |
---|---|
1280 | {1280,1279,90964,1385,91061,1239,1783272,2,131567,1} |
With some embellishment, we can see how this strategy can be used to retrieve lineages plus additional annotation for multiple tax_ids (at arbitrary ranks) as once.
SELECT 'value' as key; WITH RECURSIVE a AS ( SELECT tax_id as tid, parent_id as pid, rank, tax_id, ARRAY[tax_id] as lineage, ARRAY[rank] as ranks FROM nodes WHERE tax_id in ('1279', '1280', '562') UNION ALL SELECT p.tax_id, p.parent_id, p.rank, lineage[1], lineage || ARRAY[p.tax_id], ranks || ARRAY[p.rank] FROM a JOIN nodes p ON a.pid = p.tax_id ) SELECT tax_id, tax_name, nodes.rank, lineage, ranks FROM a JOIN nodes USING(tax_id) JOIN names USING(tax_id) WHERE a.rank = 'root' AND names.is_primary;
key | value |
---|---|
tax_id | 1279 |
tax_name | Staphylococcus |
rank | genus |
lineage | {1279,90964,1385,91061,1239,1783272,2,131567,1} |
ranks | {genus,family,order,class,phylum,below_superkingdom,superkingdom,below_root,root} |
tax_id | 562 |
tax_name | Escherichia coli |
rank | species |
lineage | {562,561,543,91347,1236,1224,2,131567,1} |
ranks | {species,genus,family,order,class,phylum,superkingdom,below_root,root} |
tax_id | 1280 |
tax_name | Staphylococcus aureus |
rank | species |
lineage | {1280,1279,90964,1385,91061,1239,1783272,2,131567,1} |
ranks | {species,genus,family,order,class,phylum,below_superkingdom,superkingdom,below_root,root} |
How well does this scale? Both of the approaches executed in less than one second for 1000 tax_id's, but took more like 40s for 2000, so not fantastically well. But I expect the some optimization is possible.
Unfortunately, in the absence of an ARRAY data type, this approach does not work for sqlite, but you could do something similar by concatenating strings (though I doubt that this would be very efficient).