Monday, May 18, 2020

Some advanced SQL to analyze COVID-19 data

Learn to write SQL
All of us are impacted by COVID-19. Looking at daily case numbers, the basic reproduction number or mortality rates has become a routine. But what is behind those numbers? How are they computed and how does SQL help? In this post, I want to discuss how some SQL analytics clauses help to dig through the data. When I teach database systems, I always try to show how it applies to life. Here, SQL may not directly help to cure the disease, but SQL is essential to compute and understand the statistics.

SQL for working with data

SQL, the Structured Query Language, is used by most database systems (Db2, MySQL, Oracle, PostgreSQL, SQL Server, SQLite, ...) as query language. Moreover, it can be directly applied to data, using programming libraries to in-memory data or something like the SQL Query service to analyze data managed with Cloud Object Storage.

Plot with COVID-19 data for Germany, showing rolling average

SQL Analytics

Most of us know how to write a simple SELECT ... FROM ... WHERE statement. But how do you easily put data from different days side by side? How do you compute rolling averages to smooth out reported values, e.g., to account for "administrative weekends"? See the above plot for raw vs. smoothed data for Germany. How do you make sure that all countries are represented, even if they did not report at the latest available date (MAX(date))? See the table below with mortality rates computed today based on the latest available data, not the current date.

The answer is to apply SQL window functions and to utilize common table expressions. They allow to partition and order data, then apply additional functions on top. I have created a public gist on GitHub with sample SQL statements to dig into COVID-19 data. It is too long to embed it here. Feel free to comment there or add to it. The two screenshots in this post are based on data processed with such SQL statements.

Mortality rate per 100,000 population

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.

The Gist: