I’m often frowned upon if I tell people that I like CSV files for storing time series data. The criticism is usually one of these three:
- CSV files are huge
- There is no standardised way to add metadata
- Different sample rates are hard to combine in one file
And this is true. Despite this, the power of CSV becomes apparent when looking at data analysis from the practical perspective of an engineer. It is fast, easy to read and has rich tooling.
Example: I have a test setup with 17 signals measured at 1kHz, giving me a 100MB file after measuring for 15 minutes. Asking for the data at row 500 000 only takes ±60ms.
This is certainly not a given for other file formats. Old MAT files for example, need to take the whole dataset into RAM. Only then can you extract the specific lines you need. XLSX files have a similar problem, where it is actually a ZIP file containing the actual data.
Easier to understand than the average README
CSV is just a text file. So it’s easy to take a quick peek inside to see what it contains.
Usually you are using a script or data pipeline to process your data file. You rely on this to be able to make calculations or visualisations of the data. For some data files, this processing will crash.
Then it is important it’s up to you to understand why. Values might be missing, text scrambled, encoding wrong, … Looking at the raw data underneath often reveals this. Having an easy way to debug this yourself saves a lot of time.
Rich tooling that won’t make you poor
CSV files are everywhere: there is a huge pool of free libaries, storage and tools available to use.
For scripting, you can get started in any language. Pandas (python), readmatrix (MATLAB) and readr (R) are examples of excellent libraries. Most IDE’s also have some syntax highlighting that makes it easier to interpret the columns.
Common data storage solutions can also import from CSV. Postgres has COPY, InfluxDB has write and SQLite has .import. Once your data is in a database, it opens up to tooling like Grafana and Marple.
CSV, what else? ☕
After having a taste of CSV, you might still be concerned about the large file size or how to add metadata properly.
HDF5 might be what you are looking for. Beware that it adds complexity over CSV, but it gives you smaller file size and a more flexible structure. How to organise the data inside the file is up to you. So take the time to properly agree upon a structure across your company or team. If you do it well, reading specific parts of data from HDF5 might be faster than reading from a CSV.
Parquet is a second alternative. It is even better at compressing data than HDF5. Therefore we see engineering companies using it mainly for long term storage.
Both of these formats are also seeing adoption in libraries. For example, pandas has read_hdf and read_parquet and support for also writing parquet as of version 1.0.0 (2020).