Thursday, November 18, 2021

On serverless data scraping, cloud object storage, MinIO and rclone

Building a data lake the serverless way
This fall, I started another side project. It involves mobility data and its analysis and visualization. As first step and still ongoing, I am building up a data lake (and maybe integrating it into a data fabric). For that, I regularly download data from various sources, then upload it to a Cloud Object Storage/S3 bucket. Once in cloud storage, I process it further, often involving tools like the MinIO client and rclone. Let me give you a quick overview and links to some more background reading...

Serverless data scraping with Code Engine

To regularly download data from open data APIs or to scrape websites, typically you would use one or many virtual machines. As detailed in the blog Data Scraping Made Easy Thanks to IBM Code Engine Jobs, I utilized a serverless approach. Code Engine is a fully managed, serverless platform for containerized workloads. Code Engine distinguishes between applications and jobs. Applications (apps) serve HTTP requests, whereas jobs run one time and then exit — kind of a batch job. This means that a Code Engine job is a good fit for retrieving data and uploading it to storage.

To regularly run the job, I can configure a periodic timer (cron) as even producer that triggers the job run. The job is the scripting, which contacts APIs or websites to retrieve data, maybe postprocesses it and then upload the data to a Cloud Object Storage bucket (see diagram above). There, the data could later be accessed by SQL Query or scripts in a notebook of an IBM Watson Studio analytics project. Accessing the data through other means( e.g., from other apps or scripts inside or outside IBM Cloud) is possible, too.

Manage S3 data with mc and rclone

Once the data is in cloud storage, how do you easily access and process it from your (local) machine? Most cloud object storage services support the Amazon S3 (Simple Storage Service) API, including IBM Cloud Object Storage. This has led to many tools and SDKs being available. Two of them are the MinIO client "mc" and rclone, which I use in my setup.

mc and rclone offer typical file system commands like ls, mv, cp, cat and even support mirroring / syncing of directories. In my blog Manage Your Cloud Object Storage Data with the MinIO Client and rclone, I discuss how to set them up and some useful commands. The screen recording below shows some of the mc commands in action.

Use mc to sync between cloud and local storage

The tool rclone even has a built-in (experimental) browser UI. Both tools are available for all the typical OS environments.

What's next?

Scraping and collecting data and moving it around are necessary steps before analyzing it. The latter includes rebuilding databases, running daily reports, interactive evaluation and experiments in Jupyter notebooks, maps and other visualizations. But that is enough content for another blog post. Stay tuned...

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.