A practical introduction to databases and data management

Wed 11 June 2014 — Filed under lectures; tags: laboratory

Edit 2015-07-26: added a section on date formatting
Edit 2016-07-11: add reference to Tidy Data

This post contains the content of a lecture prepared for Laboratory Medicine residents as part of a series on medical informatics.

Objectives
Relational database terminology
Relations
Data types
SQL
Some examples of relational database operations using SQL
Hierarchical databases
Data management patterns and antipatterns

Objectives

be aware that relational databases are a thing and can be described using some terms.
get to know the different types of relations.
see some examples of SQL in action.
make a connection between the above and some useful habits (and harmful anti-patterns) for organizing laboratory and experimental data.

Relational database terminology

relational database: a tabular database with defined "relations" (which can be described in terms of relational algebra) among elements within and between tables.
row: contains a tuple or record representing a single item.
column: represents an attribute or field and corresponds to a labeled element of a record; all elements within a column are of the same data type.
table: a set of rows and columns.

984px-Relational_database_terms.svg.png

Figure 1: Components of a table from a relational database. Source: wikipedia

primary key: a field (or multiple fields in combination) that uniquely identify a row. Can either be natural (i.e., consisting of attributes that already exist in the table), or surrogate (i.e., a new attribute with no "real world" meaning created for the purpose). An example of a combination of attributes serving as a primary key: name + DOB

pkey	mrn	name	dob	sex
1	U123456	Roger Rabbit	1939-04-01	M
2	U654321	Jessica Krupnick	1924-10-27	F

index: a data structure that improves the efficiency of data manipulation by reorganizing or clustering elements, typically as a tree.
normalization: when data is arranged to eliminate redundancy by dividing fields into separate tables, a database is said to be normalized. There is a tradeoff between normalization and complexity: a database may not be fully normalized to make queries simpler or more efficient (eg, by requiring fewer joins).

Relations

For more details, see http://en.wikipedia.org/wiki/Cardinality_(data_modeling)

one to one

describes the relationship between two or more elements in which each element uniquely identifies the other (likely to be true only in a specific context).
represented by elements in the same row
eg, name <–> mrn (but think about whether this is always true!)

Graphically:

And in our database:

select mrn, name from patients;

mrn	name
U123456	Roger Rabbit
U654321	Jessica Krupnick

one to many

in the context of a database, is represented by a pair of tables, in which an element or row can be related to multiple rows in another table

In our database:

select name, mrn, accession
from patients
join results using(mrn)
group by accession
order by mrn;

name	mrn	accession
Roger Rabbit	U123456	Acc1
Roger Rabbit	U123456	Acc2
Jessica Krupnick	U654321	Acc3
Jessica Krupnick	U654321	Acc4

many to many

an entity in table A maps to more than one entity in table B, and vice-versa.
a many-to-many relation between two tables is represented via a third junction table or cross-reference table.
for example, each patient may see more than one physician, and each physician sees many patients.

In our relational database, this relation requires three tables, one for each of patients and physicians:

doc_id	doc_name
P01	Dr. Spock
P02	Dr. Horrible
P03	Dr. Strangelove

And a junction table defining the many-to-many relationships:

select * from physician_patient;

doc_id	mrn
P01	U123456
P01	U654321
P02	U654321
P03	U123456

Now we can view all physician-patient relationships using a three-way join

select *
from patients
join physician_patient using(mrn)
join physicians using(doc_id);

pkey	mrn	name	dob	sex	doc_id	doc_name
1	U123456	Roger Rabbit	1939-04-01	M	P01	Dr. Spock
1	U123456	Roger Rabbit	1939-04-01	M	P03	Dr. Strangelove
2	U654321	Jessica Krupnick	1924-10-27	F	P01	Dr. Spock
2	U654321	Jessica Krupnick	1924-10-27	F	P02	Dr. Horrible

Data types

Most databases and programming languages make a distinction between various data types (integers, real numbers, text, binary data, etc). Why is this important?

Operations may be defined for some data types but not others (eg, division makes sense for real numbers but not text).
Different data types require different amounts of space for storage. For example, in MySQL, a column containing an integer representation of true/false (eg, a boolean) requires only 1 byte per element, whereas the string "True" will typically occupy at least 4 bytes. This can become important when anticipating database requirements or managing large amounts of data.

In an SQL database, data types are defined along with the table schema:

.schema results

CREATE TABLE IF NOT EXISTS "results"(
"mrn" TEXT, "accession" TEXT, "date" TEXT, "battery_code" TEXT,
 "test_code" TEXT, "value" TEXT, "flag" TEXT);

SQL

SQL (originally SEQUEL for Structured English Query Language) is a programming language for managing relational databases. Although versions of SQL are defined in internationally-recognized standards, various dialects are used depending on the implementation. Many relational database programs are out there. Some examples of relational database products using SQL that you are likely to come across include:

PostgreSQL (free/open source)
SQLite (free/open source)
MySQL (free/open source)
SQL Server (Microsoft, one of its flagship products)
Various Oracle products (expensive and enterprise-y)

SQL was designed to be accessible to non-technical users!

Of these database engines, SQLite is probably the easiest to try out - unlike the others, the database consists of a single, portable file that can be accessed using either a command line interface or various GUI's available for your favorite platform (SQLite is found pretty much everywhere). For example, here is the SQLite database used for the examples in this post. If you're on a Mac, try this after downloading to your Downloads folder:

open Terminal.app (press CMD+SPACE and type Term…)
type this:

cd ~/Downloads
sqlite3 results.db

You should see something like this:

SQLite version 3.7.13 2012-07-17 17:46:21
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>

Go ahead and try out some of the examples above. You can also download a GUI database browser (Wikipedia has a list) and try out your queries there.

Some examples of relational database operations using SQL

select

This query returns the entire table named results.

select * from results;

mrn	accession	date	battery_code	test_code	value	flag
U123456	Acc1	2014-06-02	BMP	GLU	135.0	H
U123456	Acc1	2014-06-02	BMP	K	4.0
U123456	Acc2	2014-06-03	CMP	GLU	90.0
U123456	Acc2	2014-06-03	CMP	K	2.7	L
U654321	Acc3	2014-06-02	CMP	GLU	85.0
U654321	Acc3	2014-06-02	CMP	K	4.1
U654321	Acc4	2014-06-03	BMP	GLU	75.0
U654321	Acc4	2014-06-03	BMP	K	4.2

Use a where clause to specify a subset of rows.

select * from results where test_code = 'GLU' order by date;

mrn	accession	date	battery_code	test_code	value	flag
U123456	Acc1	2014-06-02	BMP	GLU	135.0	H
U654321	Acc3	2014-06-02	CMP	GLU	85.0
U123456	Acc2	2014-06-03	CMP	GLU	90.0
U654321	Acc4	2014-06-03	BMP	GLU	75.0

You can also specify a subset of columns.

select mrn, date, value from results where test_code = 'GLU' order by date;

mrn	date	value
U123456	2014-06-02	135.0
U654321	2014-06-02	85.0
U123456	2014-06-03	90.0
U654321	2014-06-03	75.0

join

Relations between tables are specified using a join clause. Here's a two-way join between two tables, patients and results.

select * from patients join results using(mrn);

pkey	mrn	name	dob	sex	accession	date	battery_code	test_code	value	flag
1	U123456	Roger Rabbit	1939-04-01	M	Acc1	2014-06-02	BMP	GLU	135.0	H
1	U123456	Roger Rabbit	1939-04-01	M	Acc1	2014-06-02	BMP	K	4.0
1	U123456	Roger Rabbit	1939-04-01	M	Acc2	2014-06-03	CMP	GLU	90.0
1	U123456	Roger Rabbit	1939-04-01	M	Acc2	2014-06-03	CMP	K	2.7	L
2	U654321	Jessica Krupnick	1924-10-27	F	Acc3	2014-06-02	CMP	GLU	85.0
2	U654321	Jessica Krupnick	1924-10-27	F	Acc3	2014-06-02	CMP	K	4.1
2	U654321	Jessica Krupnick	1924-10-27	F	Acc4	2014-06-03	BMP	GLU	75.0
2	U654321	Jessica Krupnick	1924-10-27	F	Acc4	2014-06-03	BMP	K	4.2

select patients.name, results.date, tests.test_name, results.value, results.flag
from results
join tests using(test_code)
join patients using(mrn)
where test_code = 'K';

name	date	test_name	value	flag
Roger Rabbit	2014-06-02	Potassium	4.0
Roger Rabbit	2014-06-03	Potassium	2.7	L
Jessica Krupnick	2014-06-02	Potassium	4.1
Jessica Krupnick	2014-06-03	Potassium	4.2

group and aggregate

select name, test_code, min(value)
from patients join results using(mrn)
group by mrn, test_code;

name	test_code	min(value)
Roger Rabbit	GLU	135.0
Roger Rabbit	K	2.7
Jessica Krupnick	GLU	75.0
Jessica Krupnick	K	4.1

Hierarchical databases

Hierarchical databases organize data in a tree-like structure. Data is represented as a graph of one-to-many (parent -> child) relations.

hierarchical databases are arguably extremely well-suited for modeling healthcare data.
can be used to efficiently represent data that would otherwise require many tables in a well-normalized relational database.
depending on the structure of the hierarchy, certain queries can be extremely efficient: eg, all orders for Patient 1 can be found by traversing only the subtree containing data for that patient, which might represent only a tiny fraction of the entire database.
other sorts of queries can be extremely inefficient: eg, finding all results of a certain type might require traversal of the entire tree!
guess what: major healthcare applications such as the VA system (VistA), Sunquest FlexiLab, and many Epic products use a hierarchical database implemented using the MUMPS language.

Data management patterns and antipatterns

A basic understanding of database applications, relations, and data types can help guide good practices for managing data. It makes perfect sense to organize and collect (limited amounts of) data in a spreadsheet, but it helps a lot to anticipate that you may want to transfer that data into a database or use software for statistical analysis or producing graphical output (such as R). This is a lot easier if you keep a few rules in mind. Here are some tips. Note that Tidy Data by Hadley Wickham provides an excellent discussion of the ways in which data can be "tidy" or "messy."

name fields sensibly

When naming column headers, stick to lowercase letters and underscores only (numbers are ok after the first character).

bad	good
Patient #	patient_id
infection?	is_infected

If you have multiple tables, be careful to name fields consistently (don't use "DOB" in one place and "birth_date" in another).
If possible, don't use SQL keywords for column names.

don't mix data types

Values in a column are expected to be the same data type. Some databases will produce an error if you try to mix data types (eq, putting a string in a column identified as an integer type). Others may attempt to coerce data to the same type, producing unexpected results. Even without an error, mixing data types greatly complicates writing queries. For example, how can we find values greater than 10 in this table?

run_id	value
1	5
2	cancelled
3	11
4	< 2
5	4

In this case, it's better to create another column to contain the non-numeric values.

run_id	value	comment
1	5
2		cancelled
3	11
4		< 2
5	4

tall, not wide

Excel and EMR applications have made us accustomed to viewing data as a square matrix.

test_name	2014-01-01	2014-01-02	2014-01-03	2014-01-04
Sodium	135.0	137.0		150.0
Potassium		3.5	4.0	3.7

This is great for some purposes: it allows us to easily compare the values of these two tests for this one patient over time. Or perhaps the square is transposed:

date	Sodium	Potassium
2014-01-01	135.0
2014-01-02	137.0	3.5
2014-01-03		4.0
2014-01-04	150.0	3.7

However, both of these are lousy formats for storing or manipulating data:

they are likely to be sparse, where many cells are missing data, and requiring as many columns as there are dates (first table) or tests (second table) in the entire data set.
adding data for an addition test or patient is cumbersome.
it is not possible to associate additional attributes with each value.

Here's an improved representation of the above data:

test_name	date	value	method	comment
Sodium	2014-01-01	135.0	serum
Sodium	2014-01-02	137.0	ABG
Sodium	2014-01-04	150.0	serum	confirmed in duplicate
Potassium	2014-01-02	3.5	ABG
Potassium	2014-01-03	4.0	serum
Potassium	2014-01-04	3.7	serum

This table has a fixed number of columns. Adding new data is as simple as tacking rows onto the bottom. We can annotate values with arbitrary attributes, such as methods or comments. It's even easy to aggregate data from multiple patients by adding another column.

be consistent

Composing queries will be easier if you are consistent in your representations of the data. I've frequently come across data like this:

subject	antibiotics	growth	antibiotic
1	Y	+	azith.
2	N	-	azithromycin
3	yes	+	penicillin
4	n	+	Penicillin

within a column, pick a value and stick with it - it's awfully hard to write a query that accommodates multiple variants of the same term or concept.
be consistent from column to column, as well. In the example above, we see both yes/no and +/- used as boolean values. How about this instead:

subject	antibiotics	growth	antibiotic
1	yes	yes	azithromycin
2	no	no	azithromycin
3	yes	yes	penicillin
4	no	yes	penicillin

use multiple tables (normalize the data)

Here's where understanding relations is useful. When I ask for data from the laboratory system, I typically get results back that look something like this:

select results.*, b.battery_name, t.test_name, t.ref_range from results
join tests t using(test_code)
join batteries b using(battery_code);

(The data isn't actually stored like this in the database, but when the request for data includes all of the above fields, it winds up denormalized in the process of fitting it into a single output table).

In this table, test_code uniquely identifies test_name and ref_range, and battery_code uniquely identifies battery_name, so there's no need to include the last three columns in our table of results (imagine having millions of rows like this) - we can store these test and battery attributes in separate tables and perform joins as necessary.

It's worth having a look at the classic short article A Simple Guide to Five Normal Forms in Relational Database Theory

Use a standard format for dates and times

Figure 2: Obligatory xkcd (https://xkcd.com/1179/)

American humans might write today's date as 7/23/15, or perhaps 07/23/2015 or July 23, 2015; European humans might write it as 23 July 2015 or 23/7/15.

None of these formats are… (trying to be polite here) exactly optimal. For one thing, neither format results in alphanumeric sorting in chronological order. But much worse, it's hard to guess exactly what format the date is actually in, so parsing is a chore.

It turns out that there's a standard for representing dates and times that addresses these issues, and also allows the representation of time zone (which we won't get into right now): this standard is called ISO 8601. Using this standard, today would be written 2015-07-23. Please be nice and write dates like this. Follow the link to read more (you will learn that right now the ISO 8601 timestamp is something like 2015-07-23T16:55:02+17:00).

test_name	test_code	ref_range	range_low	range_hi
Glucose	GLU	62 - 125	62	125
Potassium	K	3.7 - 5.2	3.7	5.2

battery_code	battery_name
CMP	Comprehensive Metabolic Panel
BMP	Basic Metabolic Panel