From Chaos to Clarity: How I Ingested a Decade of Data from 400+ IoT Devices into InfluxDB

Eoin McCartan

As a Placement Software Engineer at Hamilton Robson, I recently tackled an issue that involved ingesting a decade’s worth of legacy sensor data from over 400 devices into InfluxDB. Having never worked with a dataset this large before, it was quite a complex task for me which involved a lot of moving parts. I had to ensure no data was missed – this data was crucial to the client and was part of the larger environmental monitoring system project we were working on.

Data Sources

We’d always planned for historical data ingestion after completing the MVP, but nothing could have prepared us for the reality. When the client finally shared their historical data files, I was shocked – 420 Excel spreadsheets, each packed with around 800,000 rows of sensor readings dating back to 2008. With a tight deadline, I had to make quick but crucial decisions about how to handle numerous data inconsistencies such as naming convention issues, empty fields or inconsistent data

Data Preparation

After reviewing the excel sheets, it quickly became apparent that a major clean-up was needed and given the volume of data – manually processing this wasn’t an option. The first critical step was converting these Excel spreadsheets into CSV format, while Excel is great for readability, viewing and editing – CSV is more lightweight and efficient for processing in a data pipeline. After the conversion pipeline these were then uploaded to an Amazon S3 Bucket for secure storage.

Ingestion Attempts

With the data now clean and standardized, I faced the next big challenge of ingesting this large dataset. It took a bit of trial and error to get my ingestion pipeline working properly – things kept breaking or running too slowly. Here are some of the attempts I initially made before finding my solution:

1st Attempt – Node.JS Service

In the initial attempt to ingest data, a Node.js service was developed to batch write data into InfluxDB using the `influxdb-client`. This service processed the CSV files in a loop, with a delay between batch requests and a checkpoint system implemented to record the CSV file name and the last successfully processed line. This setup was intended to allow for easy restarts and recovery of missed data.

However, the system ran into issues as InfluxDB began crashing unexpectedly. After extensive research, debugging and monitoring using CloudWatch metrics, it was discovered that the memory usage was gradually increasing to 100% leading to the crashes.

The problem of excessive memory usage in InfluxDB is not uncommon and has been documented by other users experiencing the same issue. It appears InfluxDB’s garbage collection process and memory handling can lead to high memory consumption, as it stores all recent writes in memory before writing them to the disk.

I needed to find a different approach that wouldn’t eat up all the memory during the ingestion.

2nd Attempt – One Large Line Protocol File

In the second attempt to ingest data into InfluxDB, I developed a script to convert each csv file into a single large line protocol file via both Influx UI and CLI. A line protocol file is a special text-based format for writing data points to InfluxDB. Since this was InfluxDB’s own way of handling data, I figured it would work better.

However, InfluxDB quickly shut that idea down with a simple error message: “the file was too large to be uploaded successfully”. I tried both the UI and command line tools, but no luck.

3rd Attempt – Batched Line Protocol Files w/ Influx CLI

In the third attempt to manage data ingestion into InfluxDB, I developed a Python script to split and convert all CSV files into four smaller line protocol files. Python was perfect for this – it’s straightforward to use, gets the job done, and is really efficient when it comes to processing large amounts of data like this.

This approach aimed to address the memory issues encountered with previous attempts by reducing the size of each data batch. After creating the files, I realized trying to upload through our office internet wasn’t going to work – so I moved everything to an Amazon EC2 instance instead. This made more sense since the EC2 instance would be in the same VPC as InfluxDB, letting them communicate directly through the CLI.

I started uploading the first file using the Influx CLI, feeling confident about this new approach.

After it’s successful upload, I noticed that memory usage spiked to around 80%. Understanding that InfluxDB temporarily stores recently written data in memory before transferring it to the disk, I decided to wait for the memory usage to decrease below 15% before proceeding with next file.

This method proved to be effective, as it ensured that all files were successfully uploaded to the database – finally marking a successful completion of the data ingestion!

While this was just a one-time task to handle historical data (from now on we’ll only be dealing with live sensor data), it was a great learning experience. I got to experience the challenges of processing large datasets, learning about InfluxDB’s limitations and discovered the importance of optimization.

LETS TALK.

Want to find out how the subject of this blog could help your business? 

Our blended team of experts go over and above with our services to our customers, no matter what the challenge. Get in touch to find out how we can work together.