Friday, March 11, 2016

Coincidence? CeBIT visitors and weather featuring Jupyter Notebooks, Spark and dashDB

Jupyter Notebook via Bluemix
Next week I am going to talk at the CeBIT fair in Hanover. As usual I am interested in how the weather will be. And with every conference or fair a common question is about attendance. Why not combine the two, analyse past CeBIT weather and visitor count for some Friday fun? Today I am going to look into Jupyter Notebooks on Apache Spark with some Open Data stored in dashDB, all available via IBM Bluemix.
(Note that I am in a hurry and don't have time for detailed steps today, but that I share the sources and will add steps later on.)

The screenshot on the right is the result of what I am going to produce today. The source file for the notebook, the exported HTML file, input data, etc. can be found in this GitHub repository. If you came here for DB2 or dashDB you might wonder what Jupyter Notebooks are. Notebooks are interactive web-pages where you have sections ("cells") that contain text or code. The text can be in different input formats including Markdown. The code cells support various programming languages, can be edited inline and are executed on demand. Basically a notebook is an interactive, on-demand business/database report. And as you can see in the screenshot, the code is able to produce graphs.

The IBM Analytics for Apache Spark service on Bluemix provides those analytic notebooks and it is the service I provisioned for my tests. Once you launch the service you can start off with sample notebooks or create them from scratch. I started with samples to get up to speed and the composed my own one (see my notebook source on GitHub). It has several cells written in Python to set up a connection to dashDB/DB2, execute queries, fetch data and process that data within the notebook. The data is used to plot out a couple graphs.

For my example I am using a dashDB (a DB2-based service) that I provisioned on Bluemix as a data store. I used the LOAD wizard to create and fill one table holding historic CeBIT dates and visitor counts and another table with historic weather data for Hanover, Germany (obtained from Deutscher Wetterdienst). Within the notebook those tables are queried and the data fetched into so-called data frames. The data frames are used to transform and shape the data as needed and as source for the generated graphs. Within the notebook it is possible to combine data frames, execute queries on them and more - something I didn't do today.

To get to my dashDB-based graphs in a Jupyter Notebook on IBM Analytics for Apache Spark I needed to get around some issues I ran into, including data type casts, naming of result columns, labeling of graphs, sourcing columns as input for a graph and more. For time reason I refer to the comments in the source code for my notebook.

After all that introduction, here is the resulting graph. It shows that during a sunny and warm week with close to no rain there were fewer CeBIT attendees. A little rain, some sun and average temperature yielded a high visitor count. So could it be that the weather to attendee relationship is bogus for computer fairs and may only hold for museums? Anyway, it was fun learing Jupyter Notebooks on Bluemix. Now I need to plot my weekend plans...
Historic CeBIT Weather and Attendance