👩‍💻 chrismanbrown.gitlab.io

marking up abbreviations with awk

awk is pretty great at pattern matching and transforming text

2024-03-13

Here’s a little awk script that I found in UNIX Text Processing that I thought was pretty clever, and a good use of awk.

What it does is scan a text for known abbreviations and then expands them.

Let’s make a little database:

AWK:Aho Weinberger Kernighan
CSS:California Style Sheets
HTML:Hyper Text Machine Learning
JS:JavaScript
JSX:JavaScript xXx
MDX:Markdown eXtreme
UNIX:Uniplexed Information and Computing Service
db/abbr.txt

Awk works like this according to one of its authors:

AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.

- Alfred V. Aho

This is what that looks like:

awk 'BEGIN { actions before any matching occurs }
pattern1 { action }
pattern2 { more action }
END { actions after processing is complete }
' file.txt
example awk script

In our awk script, we will have one pattern, the empty pattern, that matches (and acts on) every line.

#!/bin/zsh
awk '
{
  ## part 1: process data
  if (FILENAME = "db/abbr.txt") {
    ...
  }
  ## part 2: process input text
  for (a in abbr) {
    ...
  }
}' db/abbr.txt $*
bin/abbr.sh

This script is invoked like bin/abbr.sh myblogpost.md. It calls awk with the abbr database first, and then the script arguments. For each line in both files, it executes the code between the outermost curly braces.

In part 1, we use the builtin FILENAME to check the name of the file being processed. If that file is the database of abbreviations, we’ll process that line and add to an array of abbreviations and definitions.

if (FILENAME = "db/abbr.txt") {
  split($0, fields, ":")
  abbr[fields[1]] = fields[2]
  next
}
part 1: process data

The split function takes the string to split ($0 in this case, the entire line), the name of an array to split it into, and a character to split on (:).

We then create an associative array abbr. Its key is the abbreviation and its value is the definition. When db/abbr.txt is done being processed, abbr will contain all of the abbreviations and definitions in the file!

Finally, the next keyword will prevent the rest of the awk script from being executed on this line.

Part 2: “process input text” will only be run on lines that are not part of db/abbr.txt. That is, they will be run on your source file. myblogpost.md or whatever.

for (a in abbr) {
  for (f = 1; f <= NF; f++) 
    if (tolower($f) == tolower(a)) {
      $f = $f " (" abbr[a] ") "
    }
}
part 2: process input text

We begin in the outer for loop by iterating over abbr the associative array we created when processing the dictionary of abbreviations.

In the inner for loop, we iterate over all of the fields (words) in a line. NF is a builtin that holds the number of fields (words) in a record (line). For each f (field/word), we check if it is in the dictionary, and if so, append the definition to the word!

Note there is no concatenation operator.

Here is the complete program.

#!/bin/zsh
awk '
{
  if (FILENAME == "db/abbr.txt") {
    split($0, fields, ":")
    abbr[fields[1]] = fields[2]
    next
  }
  for (a in abbr) {
    for (f = 1; f <= NF; f++) 
      if (tolower($f) == tolower(a)) {
        $f = $f " (" abbr[a] ") "
      }
  }
  print $0
}' db/abbr.txt $*
bin/abbr.sh, complete

Here is a sample source file.

JSX and MDX are part of the modern JS CSS HTML web ecosystem, and both trace their origins back to UNIX an operation system for phones from the 1960s. JSX is basically JS 2FAST2FURIOUS

And here is the result of running sh bin/abbr.sh sample.txt.

JSX (JavaScript xXx) and MDX (Markdown eXtreme) are part of the modern JS (JavaScript) CSS (California Style Sheets) HTML (Hyper Text Machine Learning) web ecosystem, and both trace their origins back to UNIX an operation system for phones from the 1960s. JSX (JavaScript xXx) is basically JS (JavaScript) 2FAST2FURIOUS

As a treat, maybe you only want to provide the definition the first time it appears in the text. We can create a little guard clause to achieve this by setting abbr[a] to a after expanding the abbreviation once, and then checking for abbr[a] != a.

#!/bin/zsh
awk '
{
  if (FILENAME == "db/abbr.txt") {
    split($0, fields, ":")
    abbr[fields[1]] = fields[2]
    next
  }
  for (a in abbr) {
    if (abbr[a] != a)
      for (f = 1; f <= NF; f++) 
        if (tolower($f) == tolower(a) && abbr[a] != a) {
          $f = $f " (" abbr[a] ")"
          abbr[a] = a
        }
  }
  print $0
}' db/abbr.txt $*
we can have a little guard clause as a treat

This now results in:

JSX (JavaScript xXx) and MDX (Markdown eXtreme) are part of the modern JS (JavaScript) CSS (California Style Sheets) HTML (Hyper Text Machine Learning) web ecosystem, and both trace their origins back to UNIX (Uniplexed Information and Computing Service) an operation system for phones from the 1960s. JSX is basically JS 2FAST2FURIOUS

Neat!

This can trivially be edited to output the abbr HTML element instead of parenthetical expansion. You could also enhance the script so that it will not expand abbreviations inside code blocks, e.g.

That’s all!

The End.