Miller in 10 minutes¶
Miller verbs¶
Let's take a quick look at some of the most useful Miller verbs -- file-format-aware, name-index-empowered equivalents of standard system commands.
For most of this section we'll use our example.csv.
mlr cat
is like system cat
(or type
on Windows) -- it passes the data through unmodified:
mlr --csv cat example.csv
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 red,square,true,2,15,79.2778,0.0130 red,circle,true,3,16,13.8103,2.9010 red,square,false,4,48,77.5542,7.4670 purple,triangle,false,5,51,81.2290,8.5910 red,square,false,6,64,77.1991,9.5310 purple,triangle,false,7,65,80.1405,5.8240 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350 purple,square,false,10,91,72.3735,8.2430
But mlr cat
can also do format conversion -- for example, you can pretty-print in tabular format:
mlr --icsv --opprint cat example.csv
color shape flag k index quantity rate yellow triangle true 1 11 43.6498 9.8870 red square true 2 15 79.2778 0.0130 red circle true 3 16 13.8103 2.9010 red square false 4 48 77.5542 7.4670 purple triangle false 5 51 81.2290 8.5910 red square false 6 64 77.1991 9.5310 purple triangle false 7 65 80.1405 5.8240 yellow circle true 8 73 63.9785 4.2370 yellow circle true 9 87 63.5058 8.3350 purple square false 10 91 72.3735 8.2430
mlr head
and mlr tail
count records rather than lines. Whether you're getting the first few records or the last few, the CSV header is included either way:
mlr --csv head -n 4 example.csv
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 red,square,true,2,15,79.2778,0.0130 red,circle,true,3,16,13.8103,2.9010 red,square,false,4,48,77.5542,7.4670
mlr --csv tail -n 4 example.csv
color,shape,flag,k,index,quantity,rate purple,triangle,false,7,65,80.1405,5.8240 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350 purple,square,false,10,91,72.3735,8.2430
mlr --icsv --ojson tail -n 2 example.csv
[ { "color": "yellow", "shape": "circle", "flag": "true", "k": 9, "index": 87, "quantity": 63.5058, "rate": 8.3350 }, { "color": "purple", "shape": "square", "flag": "false", "k": 10, "index": 91, "quantity": 72.3735, "rate": 8.2430 } ]
You can sort on a single field:
mlr --icsv --opprint sort -f shape example.csv
color shape flag k index quantity rate red circle true 3 16 13.8103 2.9010 yellow circle true 8 73 63.9785 4.2370 yellow circle true 9 87 63.5058 8.3350 red square true 2 15 79.2778 0.0130 red square false 4 48 77.5542 7.4670 red square false 6 64 77.1991 9.5310 purple square false 10 91 72.3735 8.2430 yellow triangle true 1 11 43.6498 9.8870 purple triangle false 5 51 81.2290 8.5910 purple triangle false 7 65 80.1405 5.8240
Or, you can sort primarily alphabetically on one field, then secondarily numerically descending on another field, and so on:
mlr --icsv --opprint sort -f shape -nr index example.csv
color shape flag k index quantity rate yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370 red circle true 3 16 13.8103 2.9010 purple square false 10 91 72.3735 8.2430 red square false 6 64 77.1991 9.5310 red square false 4 48 77.5542 7.4670 red square true 2 15 79.2778 0.0130 purple triangle false 7 65 80.1405 5.8240 purple triangle false 5 51 81.2290 8.5910 yellow triangle true 1 11 43.6498 9.8870
If there are fields you don't want to see in your data, you can use cut
to keep only the ones you want, in the same order they appeared in the input data:
mlr --icsv --opprint cut -f flag,shape example.csv
shape flag triangle true square true circle true square false triangle false square false triangle false circle true circle true square false
You can also use cut -o
to keep specified fields, but in your preferred order:
mlr --icsv --opprint cut -o -f flag,shape example.csv
flag shape true triangle true square true circle false square false triangle false square false triangle true circle true circle false square
You can use cut -x
to omit fields you don't care about:
mlr --icsv --opprint cut -x -f flag,shape example.csv
color k index quantity rate yellow 1 11 43.6498 9.8870 red 2 15 79.2778 0.0130 red 3 16 13.8103 2.9010 red 4 48 77.5542 7.4670 purple 5 51 81.2290 8.5910 red 6 64 77.1991 9.5310 purple 7 65 80.1405 5.8240 yellow 8 73 63.9785 4.2370 yellow 9 87 63.5058 8.3350 purple 10 91 72.3735 8.2430
Even though Miller's main selling point is name-indexing, sometimes you really want to refer to a field name by its positional index. Use $[[3]]
to access the name of field 3 or $[[[3]]]
to access the value of field 3:
mlr --icsv --opprint put '$[[3]] = "NEW"' example.csv
color shape NEW k index quantity rate yellow triangle true 1 11 43.6498 9.8870 red square true 2 15 79.2778 0.0130 red circle true 3 16 13.8103 2.9010 red square false 4 48 77.5542 7.4670 purple triangle false 5 51 81.2290 8.5910 red square false 6 64 77.1991 9.5310 purple triangle false 7 65 80.1405 5.8240 yellow circle true 8 73 63.9785 4.2370 yellow circle true 9 87 63.5058 8.3350 purple square false 10 91 72.3735 8.2430
mlr --icsv --opprint put '$[[[3]]] = "NEW"' example.csv
color shape flag k index quantity rate yellow triangle NEW 1 11 43.6498 9.8870 red square NEW 2 15 79.2778 0.0130 red circle NEW 3 16 13.8103 2.9010 red square NEW 4 48 77.5542 7.4670 purple triangle NEW 5 51 81.2290 8.5910 red square NEW 6 64 77.1991 9.5310 purple triangle NEW 7 65 80.1405 5.8240 yellow circle NEW 8 73 63.9785 4.2370 yellow circle NEW 9 87 63.5058 8.3350 purple square NEW 10 91 72.3735 8.2430
You can find the full list of verbs at the Verbs Reference page.
Filtering¶
You can use filter
to keep only records you care about:
mlr --icsv --opprint filter '$color == "red"' example.csv
color shape flag k index quantity rate red square true 2 15 79.2778 0.0130 red circle true 3 16 13.8103 2.9010 red square false 4 48 77.5542 7.4670 red square false 6 64 77.1991 9.5310
mlr --icsv --opprint filter '$color == "red" && $flag == "true"' example.csv
color shape flag k index quantity rate red square true 2 15 79.2778 0.0130 red circle true 3 16 13.8103 2.9010
Computing new fields¶
You can use put
to create new fields which are computed from other fields:
mlr --icsv --opprint put ' $ratio = $quantity / $rate; $color_shape = $color . "_" . $shape ' example.csv
color shape flag k index quantity rate ratio color_shape yellow triangle true 1 11 43.6498 9.8870 4.414868008496004 yellow_triangle red square true 2 15 79.2778 0.0130 6098.292307692308 red_square red circle true 3 16 13.8103 2.9010 4.760530851430541 red_circle red square false 4 48 77.5542 7.4670 10.386259541984733 red_square purple triangle false 5 51 81.2290 8.5910 9.455127458968688 purple_triangle red square false 6 64 77.1991 9.5310 8.099790158430384 red_square purple triangle false 7 65 80.1405 5.8240 13.760388049450551 purple_triangle yellow circle true 8 73 63.9785 4.2370 15.09995279679018 yellow_circle yellow circle true 9 87 63.5058 8.3350 7.619172165566886 yellow_circle purple square false 10 91 72.3735 8.2430 8.779995147397793 purple_square
When you create a new field, it can immediately be used in subsequent statements:
mlr --icsv --opprint --from example.csv put ' $y = $index + 1; $z = $y**2 + $k; '
color shape flag k index quantity rate y z yellow triangle true 1 11 43.6498 9.8870 12 145 red square true 2 15 79.2778 0.0130 16 258 red circle true 3 16 13.8103 2.9010 17 292 red square false 4 48 77.5542 7.4670 49 2405 purple triangle false 5 51 81.2290 8.5910 52 2709 red square false 6 64 77.1991 9.5310 65 4231 purple triangle false 7 65 80.1405 5.8240 66 4363 yellow circle true 8 73 63.9785 4.2370 74 5484 yellow circle true 9 87 63.5058 8.3350 88 7753 purple square false 10 91 72.3735 8.2430 92 8474
For put
and filter
we were able to type out expressions using a programming-language syntax.
See the Miller programming language page for more information.
Handling field names with spaces¶
Sometimes our data has field names with spaces, dots, etc in them.
For verbs such as sort
, cut
, etc., no special syntax is necessary:
mlr --csv cat spaces.csv
Zone,Total MWh 14,27.2 17,39.8 24,7.4 30,50.5
mlr --c2p sort -nr 'Total MWh' spaces.csv
Zone Total MWh 30 50.5 17 39.8 14 27.2 24 7.4
For put
and filter
expressions, use ${...}
:
mlr --c2p put '${Total KWh} = ${Total MWh} * 1000' spaces.csv
Zone Total MWh Total KWh 14 27.2 27200 17 39.8 39800 24 7.4 7400 30 50.5 50500
See also the section on field names.
Multiple input files¶
Miller takes all the files from the command line as an input stream. But it's format-aware, so it doesn't repeat CSV header lines. For example, with input files data/a.csv and data/b.csv, the system cat
command will repeat header lines:
cat data/a.csv
a,b,c 1,2,3 4,5,6
cat data/b.csv
a,b,c 7,8,9
cat data/a.csv data/b.csv
a,b,c 1,2,3 4,5,6 a,b,c 7,8,9
However, mlr cat
will not:
mlr --csv cat data/a.csv data/b.csv
a,b,c 1,2,3 4,5,6 7,8,9
Chaining verbs together¶
Often we want to chain queries together -- for example, sorting by a field and taking the top few values. We can do this using pipes:
mlr --csv sort -nr index example.csv | mlr --icsv --opprint head -n 3
color shape flag k index quantity rate purple square false 10 91 72.3735 8.2430 yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370
This works fine -- but Miller also lets you chain verbs together using the word then
. Think of this as a Miller-internal pipe that lets you use fewer keystrokes:
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
color shape flag k index quantity rate purple square false 10 91 72.3735 8.2430 yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370
As another convenience, you can put the filename first using --from
. When you're interacting with your data at the command line, this makes it easier to up-arrow and append to the previous command:
mlr --icsv --opprint --from example.csv sort -nr index then head -n 3
color shape flag k index quantity rate purple square false 10 91 72.3735 8.2430 yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370
mlr --icsv --opprint --from example.csv \ sort -nr index \ then head -n 3 \ then cut -f shape,quantity
shape quantity square 72.3735 circle 63.5058 circle 63.9785
Sorts and stats¶
Now suppose you want to sort the data on a given column, and then take the top few in that ordering. You can use Miller's then
feature to pipe commands together.
Here are the records with the top three index
values:
mlr --icsv --opprint sort -nr index then head -n 3 example.csv
color shape flag k index quantity rate purple square false 10 91 72.3735 8.2430 yellow circle true 9 87 63.5058 8.3350 yellow circle true 8 73 63.9785 4.2370
Lots of Miller commands take a -g
option for group-by: here, head -n 1 -g shape
outputs the first record for each distinct value of the shape
field. This means we're finding the record with highest index
field for each distinct shape
field:
mlr --icsv --opprint sort -f shape -nr index then head -n 1 -g shape example.csv
color shape flag k index quantity rate yellow circle true 9 87 63.5058 8.3350 purple square false 10 91 72.3735 8.2430 purple triangle false 7 65 80.1405 5.8240
Statistics can be computed with or without group-by field(s):
mlr --icsv --opprint --from example.csv \ stats1 -a count,min,mean,max -f quantity -g shape
shape quantity_count quantity_min quantity_mean quantity_max triangle 3 43.6498 68.33976666666666 81.229 square 4 72.3735 76.60114999999999 79.2778 circle 3 13.8103 47.0982 63.9785
mlr --icsv --opprint --from example.csv \ stats1 -a count,min,mean,max -f quantity -g shape,color
shape color quantity_count quantity_min quantity_mean quantity_max triangle yellow 1 43.6498 43.6498 43.6498 square red 3 77.1991 78.01036666666666 79.2778 circle red 1 13.8103 13.8103 13.8103 triangle purple 2 80.1405 80.68475000000001 81.229 circle yellow 2 63.5058 63.742149999999995 63.9785 square purple 1 72.3735 72.3735 72.3735
If your output has a lot of columns, you can use XTAB format to line things up vertically for you instead:
mlr --icsv --oxtab --from example.csv \ stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate
rate_p0 0.0130 rate_p10 2.9010 rate_p25 4.2370 rate_p50 8.2430 rate_p75 8.5910 rate_p90 9.8870 rate_p99 9.8870 rate_p100 9.8870
Unicode and internationalization¶
While Miller's function names, verb names, online help, etc. are all in English, Miller supports UTF-8 data. For example:
cat παράδειγμα.csv
χρώμα,σχήμα,σημαία,κ,δείκτης,ποσότητα,ρυθμός κίτρινο,τρίγωνο,αληθινό,1,11,43.6498,9.8870 κόκκινο,τετράγωνο,αληθινό,2,15,79.2778,0.0130 κόκκινο,κύκλος,αληθινό,3,16,13.8103,2.9010 κόκκινο,τετράγωνο,ψευδές,4,48,77.5542,7.4670 μοβ,τρίγωνο,ψευδές,5,51,81.2290,8.5910 κόκκινο,τετράγωνο,ψευδές,6,64,77.1991,9.5310 μοβ,τρίγωνο,ψευδές,7,65,80.1405,5.8240 κίτρινο,κύκλος,αληθινό,8,73,63.9785,4.2370 κίτρινο,κύκλος,αληθινό,9,87,63.5058,8.3350 μοβ,τετράγωνο,ψευδές,10,91,72.3735,8.2430
mlr --c2p filter '$σχήμα == "κύκλος"' παράδειγμα.csv
χρώμα σχήμα σημαία κ δείκτης ποσότητα ρυθμός κόκκινο κύκλος αληθινό 3 16 13.8103 2.9010 κίτρινο κύκλος αληθινό 8 73 63.9785 4.2370 κίτρινο κύκλος αληθινό 9 87 63.5058 8.3350
mlr --c2p sort -f σημαία παράδειγμα.csv
χρώμα σχήμα σημαία κ δείκτης ποσότητα ρυθμός κίτρινο τρίγωνο αληθινό 1 11 43.6498 9.8870 κόκκινο τετράγωνο αληθινό 2 15 79.2778 0.0130 κόκκινο κύκλος αληθινό 3 16 13.8103 2.9010 κίτρινο κύκλος αληθινό 8 73 63.9785 4.2370 κίτρινο κύκλος αληθινό 9 87 63.5058 8.3350 κόκκινο τετράγωνο ψευδές 4 48 77.5542 7.4670 μοβ τρίγωνο ψευδές 5 51 81.2290 8.5910 κόκκινο τετράγωνο ψευδές 6 64 77.1991 9.5310 μοβ τρίγωνο ψευδές 7 65 80.1405 5.8240 μοβ τετράγωνο ψευδές 10 91 72.3735 8.2430
mlr --c2p put '$форма = toupper($форма); $длина = strlen($цвет)' пример.csv
цвет форма флаг κ индекс количество скорость длина желтый ТРЕУГОЛЬНИК истина 1 11 43.6498 9.8870 6 красный КВАДРАТ истина 2 15 79.2778 0.0130 7 красный КРУГ истина 3 16 13.8103 2.9010 7 красный КВАДРАТ ложь 4 48 77.5542 7.4670 7 фиолетовый ТРЕУГОЛЬНИК ложь 5 51 81.2290 8.5910 10 красный КВАДРАТ ложь 6 64 77.1991 9.5310 7 фиолетовый ТРЕУГОЛЬНИК ложь 7 65 80.1405 5.8240 10 желтый КРУГ истина 8 73 63.9785 4.2370 6 желтый КРУГ истина 9 87 63.5058 8.3350 6 фиолетовый КВАДРАТ ложь 10 91 72.3735 8.2430 10
File formats and format conversion¶
Miller supports the following formats:
- CSV (comma-separared values)
- TSV (tab-separated values)
- JSON (JavaScript Object Notation)
- JSON Lines (JSON with one record per line, and no outermost
[...]
) - PPRINT (pretty-printed tabular)
- XTAB (vertical-tabular or sideways-tabular)
- NIDX (numerically indexed, label-free, with implicit labels
"1"
,"2"
, etc.) - DKVP (delimited key-value pairs).
What's a CSV file, really? It's an array of rows, or records, each being a list of key-value pairs, or fields: for CSV it so happens that all the keys are shared in the header line and the values vary from one data line to another.
For example, if you have:
shape,flag,index circle,1,24 square,0,36
then that's a way of saying:
shape=circle,flag=1,index=24 shape=square,flag=0,index=36
Other ways to write the same data:
CSV PPRINT shape,flag,index shape flag index circle,1,24 circle 1 24 square,0,36 square 0 36 JSON XTAB [ shape circle { flag 1 "shape": "circle", index 24 "flag": 1, "index": 24 shape square } flag 0 { index 36 "shape": "square", "flag": 0, "index": 36 } ] JSON Lines { "shape": "circle", "flag": 1, "index": 24 } { "shape": "square", "flag": 0, "index": 36 } DKVP shape=circle,flag=1,index=24 shape=square,flag=0,index=36
Anything we can do with CSV input data, we can do with any other format input data. And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.
How to specify these to Miller:
- If you use
--csv
or--json
or--pprint
, etc., then Miller will use that format for input and output. - If you use
--icsv
and--ojson
(note the extrai
ando
) then Miller will use CSV for input and JSON for output, etc. See also Keystroke Savers for even shorter options like--c2j
.
You can read more about this at the File Formats page.
If all record values are numbers, strings, etc., then converting back and forth between CSV and JSON is a matter of specifying input-format and output-format flags:
mlr --json cat example.json
[ { "color": "yellow", "shape": "triangle", "flag": "true", "k": 1, "index": 11, "quantity": 43.6498, "rate": 9.8870 }, { "color": "red", "shape": "square", "flag": "true", "k": 2, "index": 15, "quantity": 79.2778, "rate": 0.0130 } ]
mlr --ijson --ocsv cat example.json
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 red,square,true,2,15,79.2778,0.0130
However, if JSON data has map-valued or array-valued fields, Miller gives you choices on how to convert these to CSV columns. For example, here's some JSON data with map-valued fields:
cat data/server-log.json
{ "hostname": "localhost", "pid": 12345, "req": { "id": 6789, "method": "GET", "path": "api/check", "host": "foo.bar", "headers": { "host": "bar.baz", "user-agent": "browser" } }, "res": { "status_code": 200, "header": { "content-type": "text", "content-encoding": "plain" } } }
We can convert this to CSV, or other tabular formats:
mlr --ijson --ocsv cat data/server-log.json
hostname,pid,req.id,req.method,req.path,req.host,req.headers.host,req.headers.user-agent,res.status_code,res.header.content-type,res.header.content-encoding localhost,12345,6789,GET,api/check,foo.bar,bar.baz,browser,200,text,plain
mlr --ijson --oxtab cat data/server-log.json
hostname localhost pid 12345 req.id 6789 req.method GET req.path api/check req.host foo.bar req.headers.host bar.baz req.headers.user-agent browser res.status_code 200 res.header.content-type text res.header.content-encoding plain
These transformations are reversible:
mlr --ijson --oxtab cat data/server-log.json | mlr --ixtab --ojson cat
[ { "hostname": "localhost", "pid": 12345, "req": { "id": 6789, "method": "GET", "path": "api/check", "host": "foo.bar", "headers": { "host": "bar.baz", "user-agent": "browser" } }, "res": { "status_code": 200, "header": { "content-type": "text", "content-encoding": "plain" } } } ]
See the flatten/unflatten page for more information.
Choices for printing to files¶
Often we want to print output to the screen. Miller does this by default, as we've seen in the previous examples.
Sometimes, though, we want to print output to another file. Just use > outputfilenamegoeshere
at the end of your command:
mlr --icsv --opprint cat example.csv > newfile.csv
# Output goes to the new file; # nothing is printed to the screen.
cat newfile.csv
color shape flag index quantity rate yellow triangle true 11 43.6498 9.8870 red square true 15 79.2778 0.0130 red circle true 16 13.8103 2.9010 red square false 48 77.5542 7.4670 purple triangle false 51 81.2290 8.5910 red square false 64 77.1991 9.5310 purple triangle false 65 80.1405 5.8240 yellow circle true 73 63.9785 4.2370 yellow circle true 87 63.5058 8.3350 purple square false 91 72.3735 8.2430
Other times we just want our files to be changed in-place: just use mlr -I
:
cp example.csv newfile.txt
cat newfile.txt
color,shape,flag,index,quantity,rate yellow,triangle,true,11,43.6498,9.8870 red,square,true,15,79.2778,0.0130 red,circle,true,16,13.8103,2.9010 red,square,false,48,77.5542,7.4670 purple,triangle,false,51,81.2290,8.5910 red,square,false,64,77.1991,9.5310 purple,triangle,false,65,80.1405,5.8240 yellow,circle,true,73,63.9785,4.2370 yellow,circle,true,87,63.5058,8.3350 purple,square,false,91,72.3735,8.2430
mlr -I --csv sort -f shape newfile.txt
cat newfile.txt
color,shape,flag,index,quantity,rate red,circle,true,16,13.8103,2.9010 yellow,circle,true,73,63.9785,4.2370 yellow,circle,true,87,63.5058,8.3350 red,square,true,15,79.2778,0.0130 red,square,false,48,77.5542,7.4670 red,square,false,64,77.1991,9.5310 purple,square,false,91,72.3735,8.2430 yellow,triangle,true,11,43.6498,9.8870 purple,triangle,false,51,81.2290,8.5910 purple,triangle,false,65,80.1405,5.8240
Also using mlr -I
you can bulk-operate on lots of files: e.g.:
mlr -I --csv cut -x -f unwanted_column_name *.csv
If you like, you can first copy off your original data somewhere else, before doing in-place operations.
Lastly, using tee
within put
, you can split your input data into separate files per one or more field names:
mlr --csv --from example.csv put -q 'tee > $shape.".csv", $*'
cat circle.csv
color,shape,flag,k,index,quantity,rate red,circle,true,3,16,13.8103,2.9010 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350
cat square.csv
color,shape,flag,k,index,quantity,rate red,square,true,2,15,79.2778,0.0130 red,square,false,4,48,77.5542,7.4670 red,square,false,6,64,77.1991,9.5310 purple,square,false,10,91,72.3735,8.2430
cat triangle.csv
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 purple,triangle,false,5,51,81.2290,8.5910 purple,triangle,false,7,65,80.1405,5.8240
Alternatively, the split
verb can do the same thing:
mlr --csv --from example.csv split -g shape
cat split_circle.csv
color,shape,flag,k,index,quantity,rate red,circle,true,3,16,13.8103,2.9010 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350
cat split_square.csv
color,shape,flag,k,index,quantity,rate red,square,true,2,15,79.2778,0.0130 red,square,false,4,48,77.5542,7.4670 red,square,false,6,64,77.1991,9.5310 purple,square,false,10,91,72.3735,8.2430
cat split_triangle.csv
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 purple,triangle,false,5,51,81.2290,8.5910 purple,triangle,false,7,65,80.1405,5.8240