Data Henrik: October 2014

Friday, October 24, 2014

Apply DB2 skills to Hadoop by using Big SQL on Bluemix

IBM Analytics for Hadoop

One of the many services offered on the Platform-as-a-Service (PaaS) IBM Bluemix is "IBM Analytics for Hadoop", basically InfoSphere BigInsights as cloud service. Because it is a Big Data service and it offers SQL capabilities I was eager to test it. Here is a first report how I got some queries running against data on my Hadoop file system in the cloud.

Software for download

After creating the Analytics for Hadoop service (which I left unbound) I launched the dashboard.
The first I noticed is a link to download software. I brought up a long list of drivers, Eclipse plugins, API packages and so forth. If you want to explore the full power of BigInsights / Hadoop you don't need to search for all the relevant and compatible software, it's just a click away. I passed on that offer and instead created new directories in the HDFS. I read in the BigInsights tutorials that the directories are needed to hold my data and to create workbooks or so-called BigSheets to transform data from files to something processable.

Creating the directories is done by clicking on the appropriate icon on top of the directory tree structure, picking the right parent directory and specifying a name. Been there, done that. I then used another icon to invoke the GUI for file upload. Few minutes later I had uploaded two files with historic weather data (see my older blog entries) and was in the BigSheets section of the dashboard, ready to create a workbook. I named it "weather" and also chose a file. Because the input file is not a CSV by definition but uses a semicolon to separate data, I had to apply the "Character Delimited Data with Text Qualifier" reader to it. I specified the semicolon as delimiter and also checked the "header included", i.e., the first row included the column names.

New BigInSights / Hadoop directory and uploaded files

After creating the workbook I clicked on the option "Create Table" and then experienced a "lesson learned":

Lesson learned: creating a table over a workbook in #BigInsights can be done for directories, not files. "sheet.weather" (not shit weather)
— Henrik Loeser (@data_henrik) October 23, 2014

So I created another directory, moved the files in there, deleted the workbook and wanted to create it again. However, I got an error message. Another lesson learned:

Lesson learned: When deleting a workbook from #BigSheets, first purge, then recreate. #bigdata #hadoop #bluemix
— Henrik Loeser (@data_henrik) October 23, 2014

Ok, purge the deleted workbook, then recreate it. Done. I had read that I can share the data of a workbook for use by Big SQL and Hive by creating a table based on the workbook. In my test I created "sheets.weather" as a first table. I also created a second workbook based on my first, removed some columns, and created a second table "sheets.myweather" (not too creative in naming stuff...). Anyway, the main purpose of this exercise was to import data and to get ready for SQL queries.

Next in my journey was to actually using what is called Big SQL, running SQL queries against data in my Hadoop file system. The Analytics for Hadoop service on Bluemix offers a basic SQL web console for this. As an alternative you can also issue queries from applications via JDBC (see software download above) or use other tools. As I was short on time and didn't want install any software, I decided to use the provided SQL console. When executing SQL statements you have the choice of "Big SQL" and "Big SQL V1". I picked the new "Big SQL" as it is based on the DB2 query compiler and runtime infrastructure. This comes in handy when you are unsure about available tables, column names etc. As you can see from the first screenshot below, I started by querying the Big SQL (DB2) system catalog to retrieve my available table names. The second screenshot shows a simple SQL query against the uploaded weather data.

Overall, given my short amount of time I had, it was a nice experience. Having the opportunity to apply DB2 skills towards data in a Hadoop file system reduced the barrier. Given that the data is already uploaded, I am sure I will try out other features of InfoSphere BigInsights / IBM Analytics for Hadoop, too. Have you tested it or Bluemix in general? You can sign up for a free trial (no credit card required) at http://bluemix.net.

DB2 catalog tables for Hadoop

Basic SQL query against Hadoop data

Monday, October 13, 2014

Node-RED: Simple "phoney" JSON entries in Cloudant

Node-RED Starter
on IBM Bluemix

Want to automatically store data about who called you on which phone number? Last Friday tried exactly that on IBM Bluemix and was amazed how simple it is, no real programming involved. All I needed was the Node-RED starter boilerplate (icon is on the right) on Bluemix and a new API of my mobile service provider.

The Node-RED boilerplate automatically creates a Node.js runtime environment on Bluemix and installs the Node-RED tool into it. In addition, a Cloudant JSON database is created. Once everything is deployed I opened the Node-RED tool in a Web browser. It offers a basic set of different input and output methods, processing nodes, and the ability to connect them in a flow graph. One of the input nodes is a listener for http requests. They help to react to Web service requests. I placed such http input node on the work sheet and labeled it "phone" (see screenshot).

Node-RED tool on Bluemix

How did I obtain information about callers and the numbers they called? ~~Well, I tapped my friends at a security agency.~~ I used a new free service offering at one of my phone service providers. It is called "sipgate.io" and allows to configure an URL/Web address that is accessed whenever someone calls one of the account's phone numbers (you could have multiple phone numbers). An http POST request is sent to the configured URL and the caller's phone number and the called number are included as payload. In my Node-RED application the "phone" node would answer to this request.

What I needed now was the data processing flow of the Web services request. On Friday I already tweeted the entire flow:

Easy. Cool. #NodeRED used to store #caller info coming from #sipgate API into #Cloudant on #Bluemix pic.twitter.com/MFkqFAdp5D
— Henrik Loeser (@data_henrik) October 10, 2014

On the left we see "phone" node as http input. Connected to it is the "ok" node which sends an http response back, telling the phone company's Web services that we received the information. The other connected node is a "json" processor which translates the payload (who called which number) into a meaningful JSON object. That object is then moved on to the "calls" node, a Cloudant output node. All we needed was to select the Cloudant service on Bluemix and to configure the database name.

Cloudant Output Node, Node-RED on Bluemix

The magic itself was pressing the "Deploy" button on top of my Node-RED worksheet. It created the Node.js code for my Bluemix app. I was a little bit nervous about testing the app because I didn't code anything, just clicking. For my test I took the phone in my home office and dialed the number of my mobile phone number. The result? A new, nice and shiny, "phoney" JSON document in Cloudant:

"Phoney" record in Cloudant

Almost too easy. Great stuff, but unfortunately no code to share... :) (You can find the flow on Github)

Tuesday, October 7, 2014

Starvation: Electronic books, DRM, the local library, and database locks

Over the past days I ran into an interesting database problem. It boils down to resource management and database locks. One of my sons is an avid reader and thus we have an ongoing flow of hardcopy and electronic books, most of them provided by the local public library (THANK YOU!). Recently, my son used the electronic library to place a reservation on a hard-to-get ebook. Yesterday, he received the email that the book was available exclusively to him (intention lock) and to be checked out within 48 hours (placing the exclusive lock). And so my problems began...

Trouble lending an ebook

There is a hard limit on the maximum number of checked out ebooks per account. All electronic books are lent for 14 days without a way to return them earlier because of Digital Rights Management (DRM). If the account is maxed out, lending a reserved book does not work. Pure (teenage) frustration. However, there is an exclusive lock on the book copy and nobody else can lend it either, making the book harder to get and (seemingly) even more popular. As consequence more reservation requests are placed, making the book even harder to lend. In database theory this is called starvation effect or resource starvation. My advise of "read something else" is not considered a solution.

How could this software problem be solved? A change to DRM to allow earlier returns seems to be too complex. As there is also a low limit for open reservation requests per account, temporarily bumping up the number of books that can be lent per account would both solve the starvation effect and enhance the usability. It would even increase the throughput (average books out to readers), would reduce lock waits (trying to read a certain book), and customer feedback.

BTW: The locklist configuration in DB2 (similar to the number of books lent per account) is adapted automatically by the Self Tuning Memory Manager (STMM), for easy of use, for great user/customer feedback.

Pages