We Ditched Our Database For Files
We ditched our database technology for flat files at the smapp lab and it solved a lot of our problems.
When I started at the lab we had a single database that could store up to 12 terabytes (TB) of data. I was in charge of handling a transition to a new 40TB database system. We created some tools to migrate between the two; it involved a lot of dumps and a lot of checks for data integrity. This database was intended to last a few years. Our new 40TB database was effectively full 6 months after getting everything setup. We had a lot of problems with mongodb (network issues, data clearing issues, data balancing issues, unreliable counting issues) and simply scaling up to a bigger version of this technology would lock us into a technology that clearly wasnt working well for us (you dont want tdo to more fo what isn’t working). So we decided to adopt a new system that would meet most of our needs and have options for dealing with the other cases, here’s what our needs were, and the solution to those needs:
1 - we needed to store a ton of data without spending a ton of money
2 - we needed to be able to store all sorts of data types(heterogeneous data)
3 - good python interface libraries as that’s our primary language at smapp
4 - we needed to be able to do date filtering
Here’s what I did:
I made a system that stores json objects in files, where each filename has the name of the dataset and the date and each file has all the json objects collected on that day, with one object stored on each line. The datasets are organized by their names and each one is stored in its own folder. We stream the data to a staging area. The staging area data is then compressed (bz2) and then we use rsync and rclone to make a copy of the compressed data in our archive system (disk + tape backups).
Here is what a dataset looks like:
test_volume ├── data │ ├── test_volume_data__06_06_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_07_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_08_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_09_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_10_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_11_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_12_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_13_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_14_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_15_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_16_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_17_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_18_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_19_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_20_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_21_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_22_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_23_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_24_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_25_2017__00_00_00__23_59_59.json.bz2 │ ├── test_volume_data__06_26_2017__00_00_00__23_59_59.json.bz2 │ └── test_volume_data__06_27_2017__00_00_00__23_59_59.json.bz2 ├── filters │ └── filters.json └── metadata ├── metadata.json └── metadata_limits.json
You can see all the data goes into the data folder. Filters (the details we use to collect data) in the filter folder. Metadata about collection (failed attemps, logging) go into the metadata folder. There are multiple copies of this structure. This could easily be replicated on something like amazon s3 or another cheap storage provider and you can replicate to as many filesystems you like. You dont need to get a whole database running somewhere, just a file system.
We like this system because we can store a ton of data this way with 10x compression. We are not locked into any proprietary technology, truly open software: python, bzip2, rsync, rclone, unix permissions. We store what we collect; we wont accidentally delete or transform data in a way that is untrackable or irreversible. Obviously this wont suit everyone’s needs but it’s a decent system if you have a lot of data that you don’t need real time access to. Data needs to be decompressed before use. Data cannot be queried as easily as if it was sitting in a database. We realized that in a research environment where everyone wants to do something a bit different with the data it is best just to keep a raw copy and then prepare the data from this raw copy for whatever specific project it will be used for.
Some may slyly remark that most datbases are just files under the hood, which is true but usually you don’t have a lot of control over what that file format is. This can raise serious questions about the validity or completeness of a dataset (i.e. mongo’s bson format). There’s nothing too crazy or novel about what we did but I thought it might be interesting to go through our choice to use an archive file system and free unix software to store our ‘big data’ in an age when pretty much everyone else is using a database.
Thanks for reading!