Incremental Databases
Evolving data from plain text logs, to recfiles, and finally to sqlite
2024-09-07
Contents
- Phase 1: Lines of text
- Phase 2: Tabular Data
- Interlude: A client
- Phase 3: Data Integrity
- Phase 4: Too big for rec
- Conclusion
- Resources
Phase 1: Lines of text
The most simple database I use on the regular (that isnât the filesystem) is a plain log of lines of text. This is how my todo list and my daily log work. This is really great as long as your data isnât really that tabular.
In fact, I actually have a microblog powered by a textfile formatted this way. Itâs just a timestamp followed by some text. It gets turned into html and rss and published, and thatâs all I need for it.
Contrast this with my todo.txt, which has a lot of syntax:
I can view all my pending todos with a
grep '^\. ' todo.txt
.
Pros: Extremely easy to append a new line
Cons: Relies on âmagicalâ knowledge of the format to do some queries. e.g. status characters are always at the beginning of the line, followed by an optional created-on date. Unless the item is completed, at which point it is followed by a completed-on date and an optional created-on date. This structure exists nowhere except in my head!
Phase 2: Tabular Data
My todo.txt syntax has evolved organically over time in order to make it convenient to search and query. Each item starts with a single character denoting its status. A dot for pending, an x for completed. Lines can have a created-on date and a completed-on date. It can have tags for context, represented a plus sign followed by a word. It can have a priority, denoted by P1, P2, P3.
This is still perfectly usable as is, and I can get all the particular information I want from it with a little grep and awk. But it is also becoming very tabular at this point: it can be represented by keys and fields.
This is the point of data evolution at which I might consider creating a recfile.
Pros: Still very human readable, human writable. Highly queryable. Highly structured. No more âmagicalâ structure.
Cons: More verbose. Can no longer update by appending a single line.
At this stage, Iâd be happy with either a log or a recfile. The reason Iâd want to advance to a recfile is if I want more complex querying and reporting. e.g. aggregate functions or grouping by values, etc.
Interlude: A client
I have a todo.sh script where Iâve been collecting all my queries: add a new item, mark an item as complete. List all pending, list all completed, list by priority, list by tag, etc.
This creates an abstraction called an âinterfaceâ between behavior (e.g. add a new todo item) and its implementation (e.g. format the string and prepend it to todo.txt). This is convenient because A) it allows for shorthand syntax: I can capture a new todo item by typing ât a something i really need to doâ (where ât aâ is short for âtodo addâ); and B) it allows me to change the implementation logic or the representation of the data behind the scenes without changing how I interact with the list at the interface level. It could be greps and awks, or it could be recsels. It doesnât matter.
This mostly addresses the cons of a recfile listed above. I can still ât a some item +context P1â and parse that string and insert it into a recfile.
Phase 3: Data Integrity
At this point the data is still very freeform. It is without a schema. I can just slap a new record in there with a text editor. I can add, remove, and update fields all willy-nilly. Maybe I want to create a brand new field for due dates. There are no rules. The limit does not exist!
The next phase in this incremental database journey is to clean up this lawless town. To add a schema and start enforcing types and restrictions. recutils support this via record descriptors.
I could describe my todo records thusly:
Now I can still just edit the todo list manually if I want to. (And I will if I want to edit or rephrase the text of an item.) But Iâm primarily using an abstraction, todo.sh, to interact with it. So I can start using recins(1) to add items, which will start to enforce the schema: autoinserting an âidâ and a âcreated,â only allowing a status and a priority if the values are present in the enumeration. Rejecting any field that is not present in â%allowedâ, be it a typo or whatever. Refusing to insert without all of the mandatory fields.
Now we have data integrity.
And I can run some queries a little bit more easily than with a bunch of awk.
Phase 4: Too big for rec
recutils is optimized for small data. My todo list is plenty small enough for it. But a bigger dataset, like my goodreads.com reading history, is way too big for it.
I have about 700 books in this database. recsel can read quickly enough. But recins and recset are incredibly slow writing to the database.
Fortunately, it is very easy to outgrow recutils:
Thatâs it! Super quick, super easy. Thatâs all you need to get your data into a sql database.
The thing about sql is that it is relational. Recfiles can be relational. But I donât often use them that way. In fact if Iâm doing a bunch of joins on different recfiles, thatâs a sign to me that my data is relational enough to deserve at least considering migrating to sql. Either way, in my experience the data probably wants a little normalization at this point to prepare it for its new relational life.
If you wanted to be fancy, you could probably print the record descriptor with ârecinf -d books.recâ and generate a schema with some kind of script. Here is a proof-of-concept awk script that accomplished this:
This generates something like this from todo.rec:
You will probably still need to tweak the output. e.g. enum types should probably be references to a separate enumeration table. And rec types should definitely get turned into foreign keys.
Anyway, the point is that itâs really easy to move data from rec to sqlite.
Pros: Blazingly Fast. All the expressiveness of SQL.
Cons: No longer human readable or writable.
Aside: did you know that sqlite can answer queries in markdown format?
This query:
Gives me this table:
Title | Author | Format | Shelf |
---|---|---|---|
How to Take over the World | Ryan North | Audiobook | currently-reading |
I Must Be Dreaming | Roz Chast | Hardcover | read |
The Other Significant Others | Rhaina Cohen | Audiobook | read |
Justice League vs. Godzilla vs. Kong | Brian Buccellato | Hardcover | read |
Better Living Through Birding | Christian Cooper | Audiobook | read |
Braba: A Brazilian Comics Anthology | Rafael Gramp | Paperback | read |
The Penguin Vol. 1: The Prodigal Bird | Tom King | Paperback | read |
Space-Mullet Volume 1 | Daniel Warren Johnson | Paperback | read |
Conclusion
Iâm a big fan of using as little database as possible when starting a new project. I think recutils is super great for prototyping and then throwing away.
Resources
https://twtxt.readthedocs.io/en/latest/user/intro.html#intro Twtxt, a âtimestamp + line of textâ microblogging format
https://www.gnu.org/software/recutils/ GNU recutils