Building a data lake the serverless way |
Serverless data scraping with Code Engine
To regularly download data from open data APIs or to scrape websites, typically you would use one or many virtual machines. As detailed in the blog Data Scraping Made Easy Thanks to IBM Code Engine Jobs, I utilized a serverless approach. Code Engine is a fully managed, serverless platform for containerized workloads. Code Engine distinguishes between applications and jobs. Applications (apps) serve HTTP requests, whereas jobs run one time and then exit — kind of a batch job. This means that a Code Engine job is a good fit for retrieving data and uploading it to storage.
To regularly run the job, I can configure a periodic timer (cron) as even producer that triggers the job run. The job is the scripting, which contacts APIs or websites to retrieve data, maybe postprocesses it and then upload the data to a Cloud Object Storage bucket (see diagram above). There, the data could later be accessed by SQL Query or scripts in a notebook of an IBM Watson Studio analytics project. Accessing the data through other means( e.g., from other apps or scripts inside or outside IBM Cloud) is possible, too.
Manage S3 data with mc and rclone
Once the data is in cloud storage, how do you easily access and process it from your (local) machine? Most cloud object storage services support the Amazon S3 (Simple Storage Service) API, including IBM Cloud Object Storage. This has led to many tools and SDKs being available. Two of them are the MinIO client "mc" and rclone, which I use in my setup.
mc and rclone offer typical file system commands like ls, mv, cp, cat and even support mirroring / syncing of directories. In my blog Manage Your Cloud Object Storage Data with the MinIO Client and rclone, I discuss how to set them up and some useful commands. The screen recording below shows some of the mc commands in action.
Use mc to sync between cloud and local storage |
The tool rclone even has a built-in (experimental) browser UI. Both tools are available for all the typical OS environments.
What's next?
Scraping and collecting data and moving it around are necessary steps before analyzing it. The latter includes rebuilding databases, running daily reports, interactive evaluation and experiments in Jupyter notebooks, maps and other visualizations. But that is enough content for another blog post. Stay tuned...
If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.