Intro to Miller's programming language¶
In the Miller in 10 minutes page we took a tour of some of Miller's most-used verbs including cat
, head
, tail
, cut
, and sort
. These are analogs of familiar system commands, but empowered by field-name indexing and file-format awareness: the system sort
command only knows about lines and column names like 1,2,3,4
, while mlr sort
knows about CSV/TSV/JSON/etc records, and field names like color,shape,flag,index
.
We also caught a glimpse of Miller's put
and filter
verbs. These two are special since they let you express statements using Miller's programming language. It's a embedded domain-specific language since it's inside Miller: often referred to simply as the Miller DSL.
In the DSL reference page we have a complete reference to Miller's programming language. For now, let's take a quick look at key features -- you can use as few or as many features as you like.
Records and fields¶
Let's keep using the example.csv file:
mlr --c2p put '$cost = $quantity * $rate' example.csv
color shape flag k index quantity rate cost yellow triangle true 1 11 43.6498 9.8870 431.5655726 red square true 2 15 79.2778 0.0130 1.0306114 red circle true 3 16 13.8103 2.9010 40.063680299999994 red square false 4 48 77.5542 7.4670 579.0972113999999 purple triangle false 5 51 81.2290 8.5910 697.8383389999999 red square false 6 64 77.1991 9.5310 735.7846221000001 purple triangle false 7 65 80.1405 5.8240 466.738272 yellow circle true 8 73 63.9785 4.2370 271.0769045 yellow circle true 9 87 63.5058 8.3350 529.3208430000001 purple square false 10 91 72.3735 8.2430 596.5747605000001
When we type that, a few things are happening:
- We refer to fields in the input data using a dollar sign and then the field name, e.g.
$quantity
. (If a field name contains special characters like a dot or slash, just use curly braces:${field.name}
.) - The expression
$cost = $quantity * $rate
is executed once per record of the data file. Our example.csv has 10 records so this expression was executed 10 times, with the field names$quantity
and$rate
each time bound to the current record's values for those fields. - On the left-hand side we have the new field name
$cost
which didn't come from the input data. Assignments to new variables result in a new field being placed after all the other ones. If we'd assigned to an existing field name, it would have been updated in-place. - The entire expression is surrounded by single quotes (with an adjustment needed on Windows), to get it past the system shell. Inside those, only double quotes have meaning in Miller's programming language.
Multi-line statements, and statements-from-file¶
You can use more than one statement, separating them with semicolons, and optionally putting them on lines of their own:
mlr --c2p put '$cost = $quantity * $rate; $index = $index * 100' example.csv
color shape flag k index quantity rate cost yellow triangle true 1 1100 43.6498 9.8870 431.5655726 red square true 2 1500 79.2778 0.0130 1.0306114 red circle true 3 1600 13.8103 2.9010 40.063680299999994 red square false 4 4800 77.5542 7.4670 579.0972113999999 purple triangle false 5 5100 81.2290 8.5910 697.8383389999999 red square false 6 6400 77.1991 9.5310 735.7846221000001 purple triangle false 7 6500 80.1405 5.8240 466.738272 yellow circle true 8 7300 63.9785 4.2370 271.0769045 yellow circle true 9 8700 63.5058 8.3350 529.3208430000001 purple square false 10 9100 72.3735 8.2430 596.5747605000001
mlr --c2p put ' $cost = $quantity * $rate; # Here is how to make a comment $index *= 100 ' example.csv
color shape flag k index quantity rate cost yellow triangle true 1 1100 43.6498 9.8870 431.5655726 red square true 2 1500 79.2778 0.0130 1.0306114 red circle true 3 1600 13.8103 2.9010 40.063680299999994 red square false 4 4800 77.5542 7.4670 579.0972113999999 purple triangle false 5 5100 81.2290 8.5910 697.8383389999999 red square false 6 6400 77.1991 9.5310 735.7846221000001 purple triangle false 7 6500 80.1405 5.8240 466.738272 yellow circle true 8 7300 63.9785 4.2370 271.0769045 yellow circle true 9 8700 63.5058 8.3350 529.3208430000001 purple square false 10 9100 72.3735 8.2430 596.5747605000001
Anything from a #
character to end of line is a code comment.
One of Miller's key features is the ability to express data-transformation right there at the keyboard, interactively. But if you find yourself using expressions repeatedly, you can put everything between the single quotes into a file and refer to that using put -f
:
cat dsl-example.mlr
$cost = $quantity * $rate; # Here is how to make a comment $index *= 100
mlr --c2p put -f dsl-example.mlr example.csv
color shape flag k index quantity rate cost yellow triangle true 1 1100 43.6498 9.8870 431.5655726 red square true 2 1500 79.2778 0.0130 1.0306114 red circle true 3 1600 13.8103 2.9010 40.063680299999994 red square false 4 4800 77.5542 7.4670 579.0972113999999 purple triangle false 5 5100 81.2290 8.5910 697.8383389999999 red square false 6 6400 77.1991 9.5310 735.7846221000001 purple triangle false 7 6500 80.1405 5.8240 466.738272 yellow circle true 8 7300 63.9785 4.2370 271.0769045 yellow circle true 9 8700 63.5058 8.3350 529.3208430000001 purple square false 10 9100 72.3735 8.2430 596.5747605000001
This becomes particularly important on Windows. Quite a bit of effort was put into making Miller on Windows be able to handle the kinds of single-quoted expressions we're showing here, but if you get syntax-error messages on Windows using examples in this documentation, you can put the parts between single quotes into a file and refer to that using mlr put -f
-- or, use the triple-double-quote trick as described in the Miller on Windows page.
Out-of-stream variables, begin, and end¶
Above we saw that your expression is executed once per record -- if a file has a million records, your expression will be executed a million times, once for each record. But you can mark statements to only be executed once, either before the record stream begins, or after the record stream is ended. If you know about AWK, you might have noticed that Miller's programming language is loosely inspired by it, including the begin
and end
statements.
Above we also saw that names like $quantity
are bound to each record in turn.
To make begin
and end
statements useful, we need somewhere to put things that persist across the duration of the record stream, and a way to emit them. Miller uses out-of-stream variables (or oosvars for short) whose names start with an @
sigil, along with the emit
keyword to write them into the output record stream:
mlr --c2p --from example.csv put 'begin { @sum = 0 } @sum += $quantity; end {emit @sum}'
color shape flag k index quantity rate yellow triangle true 1 11 43.6498 9.8870 red square true 2 15 79.2778 0.0130 red circle true 3 16 13.8103 2.9010 red square false 4 48 77.5542 7.4670 purple triangle false 5 51 81.2290 8.5910 red square false 6 64 77.1991 9.5310 purple triangle false 7 65 80.1405 5.8240 yellow circle true 8 73 63.9785 4.2370 yellow circle true 9 87 63.5058 8.3350 purple square false 10 91 72.3735 8.2430 sum 652.7185
If you want the end-block output to be the only output, and not include the records from the input data, you can use mlr put -q
:
mlr --c2p --from example.csv put -q 'begin { @sum = 0 } @sum += $quantity; end {emit @sum}'
sum 652.7185
mlr --c2j --from example.csv put -q 'begin { @sum = 0 } @sum += $quantity; end {emit @sum}'
[ { "sum": 652.7185 } ]
mlr --c2j --from example.csv put -q ' begin { @count = 0; @sum = 0 } @count += 1; @sum += $quantity; end {emit (@count, @sum)} '
[ { "count": 10, "sum": 652.7185 } ]
We'll see in the documentation for stats1 that there's a lower-keystroking way to get counts and sums of things:
mlr --c2j --from example.csv stats1 -a sum,count -f quantity
[ { "quantity_sum": 652.7185, "quantity_count": 10 } ]
So, take this sum/count example as an indication of the kinds of things you can do using Miller's programming language.
Context variables¶
Also inspired by AWK, the Miller DSL has the following special context variables:
FILENAME
-- the filename the current record came from. Especially useful in things likemlr ... *.csv
.FILENUM
-- similarly, but integer 1,2,3,... rather than filename.NF
-- the number of fields in the current record. Note that if you assign$newcolumn = some value
thenNF
will increment.NR
-- starting from 1, counter of how many records processed so far.FNR
-- similar, but resets to 1 at the start of each file.
cat context-example.mlr
$nf = NF; $nr = NR; $fnr = FNR; $filename = FILENAME; $filenum = FILENUM; $newnf = NF;
cat data/a.csv
a,b,c 1,2,3 4,5,6
cat data/b.csv
a,b,c 7,8,9
mlr --c2p put -f context-example.mlr data/a.csv data/b.csv
a b c nf nr fnr filename filenum newnf 1 2 3 3 1 1 data/a.csv 1 8 4 5 6 3 2 2 data/a.csv 1 8 7 8 9 3 3 1 data/b.csv 2 8
Functions and local variables¶
You can define your own functions:
cat factorial-example.mlr
func factorial(n) { if (n <= 1) { return n } else { return n * factorial(n-1) } }
mlr --c2p --from example.csv put -f factorial-example.mlr -e '$fact = factorial(NR)'
color shape flag k index quantity rate fact yellow triangle true 1 11 43.6498 9.8870 1 red square true 2 15 79.2778 0.0130 2 red circle true 3 16 13.8103 2.9010 6 red square false 4 48 77.5542 7.4670 24 purple triangle false 5 51 81.2290 8.5910 120 red square false 6 64 77.1991 9.5310 720 purple triangle false 7 65 80.1405 5.8240 5040 yellow circle true 8 73 63.9785 4.2370 40320 yellow circle true 9 87 63.5058 8.3350 362880 purple square false 10 91 72.3735 8.2430 3628800
Note that here we used the -f
flag to put
to load our function
definition, and also the -e
flag to add another statement on the command
line. (We could have also put $fact = factorial(NR)
inside
factorial-example.mlr
but that would have made that file less flexible for our
future use.)
If-statements, loops, and local variables¶
Suppose you want to only compute sums conditionally -- you can use an if
statement:
cat if-example.mlr
begin { @count_of_red = 0; @sum_of_red = 0 } if ($color == "red") { @count_of_red += 1; @sum_of_red += $quantity; } end { emit (@count_of_red, @sum_of_red) }
mlr --c2p --from example.csv put -q -f if-example.mlr
count_of_red sum_of_red 4 247.84139999999996
Miller's else-if is spelled elif
.
As we'll see more of in the control-structures reference
page, Miller has a few kinds of
for-loops. In addition to the usual 3-part for (i = 0; i < 10; i += 1)
kind
that many programming languages have, Miller also lets you loop over
maps and arrays. We
haven't encountered maps and arrays yet in this introduction, but for now it
suffices to know that $*
is a special variable holding the current record as
a map:
cat for-example.mlr
for (k, v in $*) { print "KEY IS ". k . " VALUE IS ". v; } print
mlr --csv cat data/a.csv
a,b,c 1,2,3 4,5,6
mlr --csv --from data/a.csv put -qf for-example.mlr
KEY IS a VALUE IS 1 KEY IS b VALUE IS 2 KEY IS c VALUE IS 3 KEY IS a VALUE IS 4 KEY IS b VALUE IS 5 KEY IS c VALUE IS 6
Here we used the local variables k
and v
. Now we've seen four kinds of variables:
- Record fields like
$shape
- Out-of-stream variables like
@sum
- Local variables like
k
- Built-in context variables like
NF
andNR
If you're curious about scope and extent of local variables, you can read more in the section on variables.
Arithmetic¶
Numbers in Miller's programming language are intended to operate with the principle of least surprise:
- Internally, numbers are either 64-bit signed integers or double-precision floating-point.
- Sums, differences, and products of integers are also integers (so
2*3=6
not6.0
) -- unless the result of the operation would overflow a 64-bit signed integer in which case the result is automatically converted to float. (If you ever want integer-to-integer arithmetic, usex .+ y
,x .* y
, etc.) - Quotients of integers are integers if the division is exact, else floating-point: so
6/2=3
but7/2=3.5
.
You can read more about this in the arithmetic reference.
Absent data¶
In addition to types including string, number (int/float), maps, and arrays, Miller variables can also be absent. This is when a variable never had a value assigned to it. Miller's treatment of absent data is intended to make it easy for you to handle non-homogeneous data. We'll see more in the null-data reference but the basic idea is:
- Adding a number to absent gives the number back. This means you don't have to put
@sum = 0
in yourbegin
blocks. - Any variable which has the absent value is not assigned. This means you don't have to check presence of things from one record to the next.
For example, you can sum up all the $a
values across records without having to check whether they're present or not:
mlr --json cat absent-example.json
[ { "a": 1, "b": 2 }, { "c": 3 }, { "a": 4, "b": 5 } ]
mlr --json put '@sum_of_a += $a; end {emit @sum_of_a}' absent-example.json
[ { "a": 1, "b": 2 }, { "c": 3 }, { "a": 4, "b": 5 }, { "sum_of_a": 5 } ]