đŸ‘©â€đŸ’» chrismanbrown.gitlab.io

Incremental Databases

Evolving data from plain text logs, to recfiles, and finally to sqlite

2024-09-07

Contents

  1. Phase 1: Lines of text
  2. Phase 2: Tabular Data
  3. Interlude: A client
  4. Phase 3: Data Integrity
  5. Phase 4: Too big for rec
  6. Conclusion
  7. Resources

Phase 1: Lines of text

The most simple database I use on the regular (that isn’t the filesystem) is a plain log of lines of text. This is how my todo list and my daily log work. This is really great as long as your data isn’t really that tabular.

In fact, I actually have a microblog powered by a textfile formatted this way. It’s just a timestamp followed by some text. It gets turned into html and rss and published, and that’s all I need for it.

Contrast this with my todo.txt, which has a lot of syntax:

x get a housewarming gift for hoodzilla +shopping
. book a hotel for the convention +travel P2
. 2024-09-06 can i get a discount on this bill if i ask nicely? +bills
x play D&D with friends
x 2024-09-03 2024-09-02 email the guy about the thing +project P1
figure: a few random and slightly edited lines from todo.txt

I can view all my pending todos with a grep '^\. ' todo.txt.

Pros: Extremely easy to append a new line

Cons: Relies on “magical” knowledge of the format to do some queries. e.g. status characters are always at the beginning of the line, followed by an optional created-on date. Unless the item is completed, at which point it is followed by a completed-on date and an optional created-on date. This structure exists nowhere except in my head!

Phase 2: Tabular Data

My todo.txt syntax has evolved organically over time in order to make it convenient to search and query. Each item starts with a single character denoting its status. A dot for pending, an x for completed. Lines can have a created-on date and a completed-on date. It can have tags for context, represented a plus sign followed by a word. It can have a priority, denoted by P1, P2, P3.

This is still perfectly usable as is, and I can get all the particular information I want from it with a little grep and awk. But it is also becoming very tabular at this point: it can be represented by keys and fields.

This is the point of data evolution at which I might consider creating a recfile.

status: complete
text: get a housewarming gift for hoodzilla
tags: shopping

status: todo
text: book a hotel for the convention
tags: travel
priorty: P2

status: todo
created_on: 2024-09-06
text: can i get a discount on this bill if i ask nicely?
tags: bills

status: complete
text: play D&D with friends

status: complete
completed_on: 2024-09-03
created_on: 2024-09-02
text: email the guy about the thing
tags: project
priorty: P1
figure: todo.txt but a recfile

Pros: Still very human readable, human writable. Highly queryable. Highly structured. No more “magical” structure.

Cons: More verbose. Can no longer update by appending a single line.

At this stage, I’d be happy with either a log or a recfile. The reason I’d want to advance to a recfile is if I want more complex querying and reporting. e.g. aggregate functions or grouping by values, etc.

Interlude: A client

I have a todo.sh script where I’ve been collecting all my queries: add a new item, mark an item as complete. List all pending, list all completed, list by priority, list by tag, etc.

This creates an abstraction called an “interface” between behavior (e.g. add a new todo item) and its implementation (e.g. format the string and prepend it to todo.txt). This is convenient because A) it allows for shorthand syntax: I can capture a new todo item by typing “t a something i really need to do” (where “t a” is short for “todo add”); and B) it allows me to change the implementation logic or the representation of the data behind the scenes without changing how I interact with the list at the interface level. It could be greps and awks, or it could be recsels. It doesn’t matter.

This mostly addresses the cons of a recfile listed above. I can still “t a some item +context P1” and parse that string and insert it into a recfile.

Phase 3: Data Integrity

At this point the data is still very freeform. It is without a schema. I can just slap a new record in there with a text editor. I can add, remove, and update fields all willy-nilly. Maybe I want to create a brand new field for due dates. There are no rules. The limit does not exist!

The next phase in this incremental database journey is to clean up this lawless town. To add a schema and start enforcing types and restrictions. recutils support this via record descriptors.

I could describe my todo records thusly:

%rec: todo
%doc: an item on my todo list
%key: id
%type: id int
%auto: id
%type: created,completed,due date
%auto: created
%type: status enum complete todo blocked deferred delegated note
%type: priority enum 1 2 3 0
%mandatory: id created status text
%allowed: id created completed due status text tag priority

id: 0
created_on: 2024-08-26
completed_on: 2024-09-02
status: complete
text: get a housewarming gift for hoodzilla
tags: shopping

...
figure: a record descriptor and a record from todo.rec

Now I can still just edit the todo list manually if I want to. (And I will if I want to edit or rephrase the text of an item.) But I’m primarily using an abstraction, todo.sh, to interact with it. So I can start using recins(1) to add items, which will start to enforce the schema: autoinserting an ‘id’ and a ‘created,’ only allowing a status and a priority if the values are present in the enumeration. Rejecting any field that is not present in ‘%allowed’, be it a typo or whatever. Refusing to insert without all of the mandatory fields.

Now we have data integrity.

And I can run some queries a little bit more easily than with a bunch of awk.

Phase 4: Too big for rec

recutils is optimized for small data. My todo list is plenty small enough for it. But a bigger dataset, like my goodreads.com reading history, is way too big for it.

I have about 700 books in this database. recsel can read quickly enough. But recins and recset are incredibly slow writing to the database.

Fortunately, it is very easy to outgrow recutils:

> recsel books.rec | rec2csv > books.csv
> sqlite3 books.sqlite ".import --csv books.csv books"
> # test the import with a query:
> sqlite3 books.sqlite "select Title from books order by random() limit 3"
Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch
Half Empty
A Heartbreaking Work of Staggering Genius
figure: exporting data from rec database as a csv and importing it into a sqlite database

That’s it! Super quick, super easy. That’s all you need to get your data into a sql database.

The thing about sql is that it is relational. Recfiles can be relational. But I don’t often use them that way. In fact if I’m doing a bunch of joins on different recfiles, that’s a sign to me that my data is relational enough to deserve at least considering migrating to sql. Either way, in my experience the data probably wants a little normalization at this point to prepare it for its new relational life.

If you wanted to be fancy, you could probably print the record descriptor with ‘recinf -d books.rec’ and generate a schema with some kind of script. Here is a proof-of-concept awk script that accomplished this:

#!/bin/awk -f
# rec2schema
# usage: recinf -d db.rec | rec2schema

$1 ~ "%rec:" {
  rec=$2
}

$1 ~ "%allowed:" {
  for(i=2;i<=NF;i++)
    column[$i] = $i
  next
}

$1 ~ "%type:" && $3 ~ "enum" {
  type[$2] = "text"
  opts[$2] = opts[$2] " check( " $2 " in ("
  for (i=4;i<=NF;i++) {
    opts[$2] = opts[$2] " '" $i "'"
    if (i != NF) opts[$2] = opts[$2] ","
  }
  opts[$2] = opts[$2] ") )"
  next
}

$1 ~ "%type:" {
  split($2,types,",")
  for(t in types) {
    k=types[t]
    type[k] = $3 
  }
  next
}

$1 ~ "%key:" {
  opts[$2] = opts[$2] " primary key"
  next
}

$1 ~ "%mandatory:" {
  opts[$2] = opts[$2] " not null"
  next
}

$1 ~ "%auto" {
  if (type[$2] == "date") {
    opts[$2] = opts[$2] " default current_timestamp"
  } else {
    opts[$2] = opts[$2] " autoincrement"
  }
  next
}

END {
  print "create table if not exists " rec " ("
  for (c in column) {
    t = type[c] == "" ? "text" :
      type[c] == "date" ? "timestamp" :
      type[c] == "line" ? "text" :
      type[c] == "int" ? "integer" :
      type[c]
    printf "%s %s%s,\n",c,t,opts[c]
  }
  print ");"
}
figure: rec2schema.awk

This generates something like this from todo.rec:

create table if not exists todo (
id integer primary key autoincrement not null,
created timestamp default current_timestamp,
completed timestamp,
due timestamp,
text text,
tag text,
status text check( status in ( 'complete', 'todo', 'blocked', 'deferred', 'delegated', 'note') ),
priority text check( priority in ( '1', '2', '3', '0') ),
);
figure: an sql Create Table statement generated from a record descriptor with rec2schema.awk

You will probably still need to tweak the output. e.g. enum types should probably be references to a separate enumeration table. And rec types should definitely get turned into foreign keys.

> recsel books.rec | rec2csv > books.csv
> recinf -d books.rec | rec2schema > books.schema
> # (edit books.schema as needed)
> printf "%s\n" \
    ".read books.schema" \
    ".import --csv --skip 1 books.csv books" \
    | books.sqlite
figure: importing rows and a schema into a sqlite table from a recfile

Anyway, the point is that it’s really easy to move data from rec to sqlite.

Pros: Blazingly Fast. All the expressiveness of SQL.

Cons: No longer human readable or writable.

Aside: did you know that sqlite can answer queries in markdown format?

This query:

> printf "%s\n" \
  ".mode markdown" \
  "select Title,Author,Format,Shelf \
    from books b \
    join authors a on b.AuthorId = a.Id \
    join exclusive_shelves e on b.ExclusiveShelfId = e.Id \
    order by b.Id desc \
    limit 8" \
  | sqlite3 books.db
figure: a sqlite query designed to showcase “.mode markdown”

Gives me this table:

Title Author Format Shelf
How to Take over the World Ryan North Audiobook currently-reading
I Must Be Dreaming Roz Chast Hardcover read
The Other Significant Others Rhaina Cohen Audiobook read
Justice League vs. Godzilla vs. Kong Brian Buccellato Hardcover read
Better Living Through Birding Christian Cooper Audiobook read
Braba: A Brazilian Comics Anthology Rafael Gramp Paperback read
The Penguin Vol. 1: The Prodigal Bird Tom King Paperback read
Space-Mullet Volume 1 Daniel Warren Johnson Paperback read

Conclusion

I’m a big fan of using as little database as possible when starting a new project. I think recutils is super great for prototyping and then throwing away.

Resources