• Edit 2015-07-26: added a section on date formatting
  • Edit 2016-07-11: add reference to Tidy Data

This post contains the content of a lecture prepared for Laboratory Medicine residents as part of a series on medical informatics.

Objectives

  • be aware that relational databases are a thing and can be described using some terms.
  • get to know the different types of relations.
  • see some examples of SQL in action.
  • make a connection between the above and some useful habits (and harmful anti-patterns) for organizing laboratory and experimental data.

Relational database terminology

relational database
a tabular database with defined "relations" (which can be described in terms of relational algebra) among elements within and between tables.
row
contains a tuple or record representing a single item.
column
represents an attribute or field and corresponds to a labeled element of a record; all elements within a column are of the same data type.
table
a set of rows and columns.

984px-Relational_database_terms.svg.png

Figure 1: Components of a table from a relational database. Source: wikipedia

primary key
a field (or multiple fields in combination) that uniquely identify a row. Can either be natural (i.e., consisting of attributes that already exist in the table), or surrogate (i.e., a new attribute with no "real world" meaning created for the purpose). An example of a combination of attributes serving as a primary key: name + DOB
pkey mrn name dob sex
1 U123456 Roger Rabbit 1939-04-01 M
2 U654321 Jessica Krupnick 1924-10-27 F
index
a data structure that improves the efficiency of data manipulation by reorganizing or clustering elements, typically as a tree.
normalization
when data is arranged to eliminate redundancy by dividing fields into separate tables, a database is said to be normalized. There is a tradeoff between normalization and complexity: a database may not be fully normalized to make queries simpler or more efficient (eg, by requiring fewer joins).

Relations

one to one

  • describes the relationship between two or more elements in which each element uniquely identifies the other (likely to be true only in a specific context).
  • represented by elements in the same row
  • eg, name <–> mrn (but think about whether this is always true!)

Graphically:

one_to_one.png

And in our database:

select mrn, name from patients;
mrn name
U123456 Roger Rabbit
U654321 Jessica Krupnick

one to many

  • in the context of a database, is represented by a pair of tables, in which an element or row can be related to multiple rows in another table

one_to_many.png

In our database:

select name, mrn, accession
from patients
join results using(mrn)
group by accession
order by mrn;
name mrn accession
Roger Rabbit U123456 Acc1
Roger Rabbit U123456 Acc2
Jessica Krupnick U654321 Acc3
Jessica Krupnick U654321 Acc4

many to many

  • an entity in table A maps to more than one entity in table B, and vice-versa.
  • a many-to-many relation between two tables is represented via a third junction table or cross-reference table.
  • for example, each patient may see more than one physician, and each physician sees many patients.

many_to_many.png

In our relational database, this relation requires three tables, one for each of patients and physicians:

doc_id doc_name
P01 Dr. Spock
P02 Dr. Horrible
P03 Dr. Strangelove

And a junction table defining the many-to-many relationships:

select * from physician_patient;
doc_id mrn
P01 U123456
P01 U654321
P02 U654321
P03 U123456

Now we can view all physician-patient relationships using a three-way join

select *
from patients
join physician_patient using(mrn)
join physicians using(doc_id);
pkey mrn name dob sex doc_id doc_name
1 U123456 Roger Rabbit 1939-04-01 M P01 Dr. Spock
1 U123456 Roger Rabbit 1939-04-01 M P03 Dr. Strangelove
2 U654321 Jessica Krupnick 1924-10-27 F P01 Dr. Spock
2 U654321 Jessica Krupnick 1924-10-27 F P02 Dr. Horrible

Data types

Most databases and programming languages make a distinction between various data types (integers, real numbers, text, binary data, etc). Why is this important?

  • Operations may be defined for some data types but not others (eg, division makes sense for real numbers but not text).
  • Different data types require different amounts of space for storage. For example, in MySQL, a column containing an integer representation of true/false (eg, a boolean) requires only 1 byte per element, whereas the string "True" will typically occupy at least 4 bytes. This can become important when anticipating database requirements or managing large amounts of data.

In an SQL database, data types are defined along with the table schema:

.schema results
CREATE TABLE IF NOT EXISTS "results"(
"mrn" TEXT, "accession" TEXT, "date" TEXT, "battery_code" TEXT,
 "test_code" TEXT, "value" TEXT, "flag" TEXT);

SQL

SQL (originally SEQUEL for Structured English Query Language) is a programming language for managing relational databases. Although versions of SQL are defined in internationally-recognized standards, various dialects are used depending on the implementation. Many relational database programs are out there. Some examples of relational database products using SQL that you are likely to come across include:

  • PostgreSQL (free/open source)
  • SQLite (free/open source)
  • MySQL (free/open source)
  • SQL Server (Microsoft, one of its flagship products)
  • Various Oracle products (expensive and enterprise-y)

SQL was designed to be accessible to non-technical users!

Of these database engines, SQLite is probably the easiest to try out - unlike the others, the database consists of a single, portable file that can be accessed using either a command line interface or various GUI's available for your favorite platform (SQLite is found pretty much everywhere). For example, here is the SQLite database used for the examples in this post. If you're on a Mac, try this after downloading to your Downloads folder:

  • open Terminal.app (press CMD+SPACE and type Term…)
  • type this:
cd ~/Downloads
sqlite3 results.db

You should see something like this:

SQLite version 3.7.13 2012-07-17 17:46:21
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>

Go ahead and try out some of the examples above. You can also download a GUI database browser (Wikipedia has a list) and try out your queries there.

Some examples of relational database operations using SQL

select

This query returns the entire table named results.

select * from results;
mrn accession date battery_code test_code value flag
U123456 Acc1 2014-06-02 BMP GLU 135.0 H
U123456 Acc1 2014-06-02 BMP K 4.0  
U123456 Acc2 2014-06-03 CMP GLU 90.0  
U123456 Acc2 2014-06-03 CMP K 2.7 L
U654321 Acc3 2014-06-02 CMP GLU 85.0  
U654321 Acc3 2014-06-02 CMP K 4.1  
U654321 Acc4 2014-06-03 BMP GLU 75.0  
U654321 Acc4 2014-06-03 BMP K 4.2  

Use a where clause to specify a subset of rows.

select * from results where test_code = 'GLU' order by date;
mrn accession date battery_code test_code value flag
U123456 Acc1 2014-06-02 BMP GLU 135.0 H
U654321 Acc3 2014-06-02 CMP GLU 85.0  
U123456 Acc2 2014-06-03 CMP GLU 90.0  
U654321 Acc4 2014-06-03 BMP GLU 75.0  

You can also specify a subset of columns.

select mrn, date, value from results where test_code = 'GLU' order by date;
mrn date value
U123456 2014-06-02 135.0
U654321 2014-06-02 85.0
U123456 2014-06-03 90.0
U654321 2014-06-03 75.0

join

Relations between tables are specified using a join clause. Here's a two-way join between two tables, patients and results.

select * from patients join results using(mrn);
pkey mrn name dob sex accession date battery_code test_code value flag
1 U123456 Roger Rabbit 1939-04-01 M Acc1 2014-06-02 BMP GLU 135.0 H
1 U123456 Roger Rabbit 1939-04-01 M Acc1 2014-06-02 BMP K 4.0  
1 U123456 Roger Rabbit 1939-04-01 M Acc2 2014-06-03 CMP GLU 90.0  
1 U123456 Roger Rabbit 1939-04-01 M Acc2 2014-06-03 CMP K 2.7 L
2 U654321 Jessica Krupnick 1924-10-27 F Acc3 2014-06-02 CMP GLU 85.0  
2 U654321 Jessica Krupnick 1924-10-27 F Acc3 2014-06-02 CMP K 4.1  
2 U654321 Jessica Krupnick 1924-10-27 F Acc4 2014-06-03 BMP GLU 75.0  
2 U654321 Jessica Krupnick 1924-10-27 F Acc4 2014-06-03 BMP K 4.2  
select patients.name, results.date, tests.test_name, results.value, results.flag
from results
join tests using(test_code)
join patients using(mrn)
where test_code = 'K';
name date test_name value flag
Roger Rabbit 2014-06-02 Potassium 4.0  
Roger Rabbit 2014-06-03 Potassium 2.7 L
Jessica Krupnick 2014-06-02 Potassium 4.1  
Jessica Krupnick 2014-06-03 Potassium 4.2  

group and aggregate

select name, test_code, min(value)
from patients join results using(mrn)
group by mrn, test_code;
name test_code min(value)
Roger Rabbit GLU 135.0
Roger Rabbit K 2.7
Jessica Krupnick GLU 75.0
Jessica Krupnick K 4.1

Hierarchical databases

Hierarchical databases organize data in a tree-like structure. Data is represented as a graph of one-to-many (parent -> child) relations.

hierarchical.png

  • hierarchical databases are arguably extremely well-suited for modeling healthcare data.
  • can be used to efficiently represent data that would otherwise require many tables in a well-normalized relational database.
  • depending on the structure of the hierarchy, certain queries can be extremely efficient: eg, all orders for Patient 1 can be found by traversing only the subtree containing data for that patient, which might represent only a tiny fraction of the entire database.
  • other sorts of queries can be extremely inefficient: eg, finding all results of a certain type might require traversal of the entire tree!
  • guess what: major healthcare applications such as the VA system (VistA), Sunquest FlexiLab, and many Epic products use a hierarchical database implemented using the MUMPS language.

Data management patterns and antipatterns

A basic understanding of database applications, relations, and data types can help guide good practices for managing data. It makes perfect sense to organize and collect (limited amounts of) data in a spreadsheet, but it helps a lot to anticipate that you may want to transfer that data into a database or use software for statistical analysis or producing graphical output (such as R). This is a lot easier if you keep a few rules in mind. Here are some tips. Note that Tidy Data by Hadley Wickham provides an excellent discussion of the ways in which data can be "tidy" or "messy."

name fields sensibly

  • When naming column headers, stick to lowercase letters and underscores only (numbers are ok after the first character).
bad good
Patient # patient_id
infection? is_infected
  • If you have multiple tables, be careful to name fields consistently (don't use "DOB" in one place and "birth_date" in another).
  • If possible, don't use SQL keywords for column names.

don't mix data types

Values in a column are expected to be the same data type. Some databases will produce an error if you try to mix data types (eq, putting a string in a column identified as an integer type). Others may attempt to coerce data to the same type, producing unexpected results. Even without an error, mixing data types greatly complicates writing queries. For example, how can we find values greater than 10 in this table?

run_id value
1 5
2 cancelled
3 11
4 < 2
5 4

In this case, it's better to create another column to contain the non-numeric values.

run_id value comment
1 5  
2   cancelled
3 11  
4   < 2
5 4  

tall, not wide

Excel and EMR applications have made us accustomed to viewing data as a square matrix.

test_name 2014-01-01 2014-01-02 2014-01-03 2014-01-04
Sodium 135.0 137.0   150.0
Potassium   3.5 4.0 3.7

This is great for some purposes: it allows us to easily compare the values of these two tests for this one patient over time. Or perhaps the square is transposed:

date Sodium Potassium
2014-01-01 135.0  
2014-01-02 137.0 3.5
2014-01-03   4.0
2014-01-04 150.0 3.7

However, both of these are lousy formats for storing or manipulating data:

  • they are likely to be sparse, where many cells are missing data, and requiring as many columns as there are dates (first table) or tests (second table) in the entire data set.
  • adding data for an addition test or patient is cumbersome.
  • it is not possible to associate additional attributes with each value.

Here's an improved representation of the above data:

test_name date value method comment
Sodium 2014-01-01 135.0 serum  
Sodium 2014-01-02 137.0 ABG  
Sodium 2014-01-04 150.0 serum confirmed in duplicate
Potassium 2014-01-02 3.5 ABG  
Potassium 2014-01-03 4.0 serum  
Potassium 2014-01-04 3.7 serum  

This table has a fixed number of columns. Adding new data is as simple as tacking rows onto the bottom. We can annotate values with arbitrary attributes, such as methods or comments. It's even easy to aggregate data from multiple patients by adding another column.

be consistent

Composing queries will be easier if you are consistent in your representations of the data. I've frequently come across data like this:

subject antibiotics growth antibiotic
1 Y + azith.
2 N - azithromycin
3 yes + penicillin
4 n + Penicillin
  • within a column, pick a value and stick with it - it's awfully hard to write a query that accommodates multiple variants of the same term or concept.
  • be consistent from column to column, as well. In the example above, we see both yes/no and +/- used as boolean values. How about this instead:
subject antibiotics growth antibiotic
1 yes yes azithromycin
2 no no azithromycin
3 yes yes penicillin
4 no yes penicillin

use multiple tables (normalize the data)

Here's where understanding relations is useful. When I ask for data from the laboratory system, I typically get results back that look something like this:

select results.*, b.battery_name, t.test_name, t.ref_range from results
join tests t using(test_code)
join batteries b using(battery_code);

(The data isn't actually stored like this in the database, but when the request for data includes all of the above fields, it winds up denormalized in the process of fitting it into a single output table).

In this table, test_code uniquely identifies test_name and ref_range, and battery_code uniquely identifies battery_name, so there's no need to include the last three columns in our table of results (imagine having millions of rows like this) - we can store these test and battery attributes in separate tables and perform joins as necessary.

It's worth having a look at the classic short article A Simple Guide to Five Normal Forms in Relational Database Theory

Use a standard format for dates and times

iso_8601.png

Figure 2: Obligatory xkcd (https://xkcd.com/1179/)

American humans might write today's date as 7/23/15, or perhaps 07/23/2015 or July 23, 2015; European humans might write it as 23 July 2015 or 23/7/15.

None of these formats are… (trying to be polite here) exactly optimal. For one thing, neither format results in alphanumeric sorting in chronological order. But much worse, it's hard to guess exactly what format the date is actually in, so parsing is a chore.

It turns out that there's a standard for representing dates and times that addresses these issues, and also allows the representation of time zone (which we won't get into right now): this standard is called ISO 8601. Using this standard, today would be written 2015-07-23. Please be nice and write dates like this. Follow the link to read more (you will learn that right now the ISO 8601 timestamp is something like 2015-07-23T16:55:02+17:00).