Two-pass algorithms¶
Overview¶
Miller is a streaming record processor; commands are performed once per record.
(See here
and here for an introductory discussion.) This
makes Miller particularly suitable for single-pass algorithms, allowing many of
its verbs to process files that are (much) larger than the amount of RAM
present in your system. (Of course, Miller verbs such as sort
, tac
, etc.
all must ingest and retain all input records before emitting any output records
-- see the page on streaming processing and memory
usage.) You can also use out-of-stream
variables to perform
multi-pass computations, at the price of retaining all input records in memory.
One of Miller's strengths is its compact notation: for example, given input of the form
head -n 5 ./data/medium
a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729
you can simply do
mlr --oxtab stats1 -a sum -f x ./data/medium
x_sum 4986.019681679581
or
mlr --opprint stats1 -a sum -f x -g b ./data/medium
b x_sum pan 965.7636699425815 wye 1023.5484702619565 zee 979.7420161495838 eks 1016.7728571314786 hat 1000.192668193983
rather than the more tedious
mlr --oxtab put -q ' @x_sum += $x; end { emit @x_sum } ' data/medium
x_sum 4986.019681679581
or
mlr --opprint put -q ' @x_sum[$b] += $x; end { emit @x_sum, "b" } ' data/medium
b x_sum pan 965.7636699425815 wye 1023.5484702619565 zee 979.7420161495838 eks 1016.7728571314786 hat 1000.192668193983
The former (mlr stats1
et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.
Nonetheless, out-of-stream variables (which I whimsically call oosvars), begin/end blocks, and emit statements give you the ability to implement logic -- if you wish to do so -- which isn't present in other Miller verbs. (If you find yourself often using the same out-of-stream-variable logic over and over, please file a request at https://github.com/johnkerl/miller/issues to get it implemented directly in Go as a Miller verb of its own.)
The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.
Computation of percentages¶
For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record's value to a percentage.
mlr --from data/small --opprint put -q ' # These are executed once per record, which is the first pass. # The key is to use NR to index an out-of-stream variable to # retain all the x-field values. @x_min = min($x, @x_min); @x_max = max($x, @x_max); @x[NR] = $x; # The second pass is in a for-loop in an end-block. end { for (nr, x in @x) { @x_pct[nr] = 100 * (x - @x_min) / (@x_max - @x_min); } emit (@x, @x_pct), "NR" } '
NR x x_pct 1 0.346791 25.66218352716956 2 0.758679 100 3 0.204603 0 4 0.381399 31.90825807289974 5 0.573288 66.54051068806446
Line-number ratios¶
Similarly, finding the total record count requires first reading through all the data:
mlr --opprint --from data/small put -q ' @records[NR] = $*; end { for((Istring,k),v in @records) { int I = int(Istring); @records[I]["I"] = I; @records[I]["N"] = NR; @records[I]["PCT"] = 100*I/NR } emit @records,"I" } ' then reorder -f I,N,PCT
I N PCT a b i x y 1 5 20 pan pan 1 0.346791 0.726802 2 5 40 eks pan 2 0.758679 0.522151 3 5 60 wye wye 3 0.204603 0.338318 4 5 80 eks wye 4 0.381399 0.134188 5 5 100 wye pan 5 0.573288 0.863624
Records having max value¶
The idea is to retain records having the largest value of n
in the following data:
mlr --itsv --opprint cat data/maxrows.tsv
a b n score purple red 5 0.743231 blue purple 2 0.093710 red purple 2 0.802103 purple red 5 0.389055 red purple 2 0.880457 orange red 2 0.540349 purple purple 1 0.634451 orange purple 5 0.257223 orange purple 5 0.693499 red red 4 0.981355 blue purple 5 0.157052 purple purple 1 0.441784 red purple 1 0.124912 orange blue 1 0.921944 blue purple 4 0.490909 purple red 5 0.454779 green purple 4 0.198278 orange blue 5 0.705700 red red 3 0.940705 purple red 5 0.072936 orange blue 3 0.389463 orange purple 2 0.664985 blue purple 1 0.371813 red purple 4 0.984571 green purple 5 0.203577 green purple 3 0.900873 purple purple 0 0.965677 blue purple 2 0.208785 purple purple 1 0.455077 red purple 4 0.477187 blue red 4 0.007487
Of course, the largest value of n
isn't known until after all data have been read. Using an out-of-stream variable we can retain all records as they are read, then filter them at the end:
cat data/maxrows.mlr
# Retain all records @records[NR] = $*; # Track max value of n @maxn = max(@maxn, $n); # After all records have been read, loop through retained records # and print those with the max n value. end { for (nr in @records) { map record = @records[nr]; if (record["n"] == @maxn) { emit record; } } }
mlr --itsv --opprint put -q -f data/maxrows.mlr data/maxrows.tsv
a b n score purple red 5 0.743231 purple red 5 0.389055 orange purple 5 0.257223 orange purple 5 0.693499 blue purple 5 0.157052 purple red 5 0.454779 orange blue 5 0.705700 purple red 5 0.072936 green purple 5 0.203577
Feature-counting¶
Suppose you have some heterogeneous data like this:
{ "qoh": 29874, "rate": 1.68, "latency": 0.02 } { "name": "alice", "uid": 572 } { "qoh": 1227, "rate": 1.01, "latency": 0.07 } { "qoh": 13458, "rate": 1.72, "latency": 0.04 } { "qoh": 56782, "rate": 1.64 } { "qoh": 23512, "rate": 1.71, "latency": 0.03 } { "qoh": 9876, "rate": 1.89, "latency": 0.08 } { "name": "bill", "uid": 684 } { "name": "chuck", "uid2": 908 } { "name": "dottie", "uid": 440 } { "qoh": 0, "rate": 0.40, "latency": 0.01 } { "qoh": 5438, "rate": 1.56, "latency": 0.17 }
A reasonable question to ask is, how many occurrences of each field are there? And, what percentage of total row count has each of them? Since the denominator of the percentage is not known until the end, this is a two-pass algorithm:
for (key in $*) { @key_counts[key] += 1; } @record_count += 1; end { for (key in @key_counts) { @key_fraction[key] = @key_counts[key] / @record_count } emit @record_count; emit @key_counts, "key"; emit @key_fraction,"key" }
Then
mlr --json put -q -f data/feature-count.mlr data/features.json
[ { "record_count": 12 }, { "key": "qoh", "key_counts": 8 }, { "key": "rate", "key_counts": 8 }, { "key": "latency", "key_counts": 7 }, { "key": "name", "key_counts": 4 }, { "key": "uid", "key_counts": 3 }, { "key": "uid2", "key_counts": 1 }, { "key": "qoh", "key_fraction": 0.6666666666666666 }, { "key": "rate", "key_fraction": 0.6666666666666666 }, { "key": "latency", "key_fraction": 0.5833333333333334 }, { "key": "name", "key_fraction": 0.3333333333333333 }, { "key": "uid", "key_fraction": 0.25 }, { "key": "uid2", "key_fraction": 0.08333333333333333 } ]
mlr --ijson --opprint put -q -f data/feature-count.mlr data/features.json
record_count 12 key key_counts qoh 8 rate 8 latency 7 name 4 uid 3 uid2 1 key key_fraction qoh 0.6666666666666666 rate 0.6666666666666666 latency 0.5833333333333334 name 0.3333333333333333 uid 0.25 uid2 0.08333333333333333
Unsparsing¶
The previous section discussed how to fill out missing data fields within CSV with full header line -- so the list of all field names is present within the header line. Next, let's look at a related problem: we have data where each record has various key names but we want to produce rectangular output having the union of all key names.
There is a keystroke-saving verb for this: unsparsify. Here, we look at how to implement that in the DSL.
For example, suppose you have JSON input like this:
cat data/sparse.json
{"a":1,"b":2,"v":3} {"u":1,"b":2} {"a":1,"v":2,"x":3} {"v":1,"w":2}
There are field names a
, b
, v
, u
, x
, w
in the data -- but not all in every record. Since we don't know the names of all the keys until we've read them all, this needs to be a two-pass algorithm. On the first pass, remember all the unique key names and all the records; on the second pass, loop through the records filling in absent values, then producing output. Use put -q
since we don't want to produce per-record output, only emitting output in the end
block:
cat data/unsparsify.mlr
# First pass: # Remember all unique key names: for (k in $*) { @all_keys[k] = 1; } # Remember all input records: @records[NR] = $*; # Second pass: end { for (nr in @records) { # Get the sparsely keyed input record: irecord = @records[nr]; # Fill in missing keys with empty string: map orecord = {}; for (k in @all_keys) { if (haskey(irecord, k)) { orecord[k] = irecord[k]; } else { orecord[k] = ""; } } # Produce the output: emit orecord; } }
mlr --json put -q -f data/unsparsify.mlr data/sparse.json
[ { "a": 1, "b": 2, "v": 3, "u": "", "x": "", "w": "" }, { "a": "", "b": 2, "v": "", "u": 1, "x": "", "w": "" }, { "a": 1, "b": "", "v": 2, "u": "", "x": 3, "w": "" }, { "a": "", "b": "", "v": 1, "u": "", "x": "", "w": 2 } ]
mlr --ijson --ocsv put -q -f data/unsparsify.mlr data/sparse.json
a,b,v,u,x,w 1,2,3,,, ,2,,1,, 1,,2,,3, ,,1,,,2
mlr --ijson --opprint put -q -f data/unsparsify.mlr data/sparse.json
a b v u x w 1 2 3 - - - - 2 - 1 - - 1 - 2 - 3 - - - 1 - - 2
Mean without/with oosvars¶
mlr --opprint stats1 -a mean -f x data/medium
x_mean 0.49860196816795804
mlr --opprint put -q ' @x_sum += $x; @x_count += 1; end { @x_mean = @x_sum / @x_count; emit @x_mean } ' data/medium
x_mean 0.49860196816795804
Keyed mean without/with oosvars¶
mlr --opprint stats1 -a mean -f x -g a,b data/medium
a b x_mean pan pan 0.5133141190437597 eks pan 0.48507555383425127 wye wye 0.49150092785839306 eks wye 0.4838950517724162 wye pan 0.4996119901034838 zee pan 0.5198298297816007 eks zee 0.49546320772681596 zee wye 0.5142667998230479 hat wye 0.49381326184632596 pan wye 0.5023618498923658 zee eks 0.4883932942792647 hat zee 0.5099985721987774 hat eks 0.48587864619953547 wye hat 0.4977304763723314 pan eks 0.5036718595143479 eks eks 0.5227992666570941 hat hat 0.47993053101017374 hat pan 0.4643355557376876 zee zee 0.5127559183726382 pan hat 0.492140950155604 pan zee 0.4966041598627583 zee hat 0.46772617655014515 wye zee 0.5059066170573692 eks hat 0.5006790659966355 wye eks 0.5306035254809106
mlr --opprint put -q ' @x_sum[$a][$b] += $x; @x_count[$a][$b] += 1; end{ for ((a, b), v in @x_sum) { @x_mean[a][b] = @x_sum[a][b] / @x_count[a][b]; } emit @x_mean, "a", "b" } ' data/medium
a b x_mean pan pan 0.5133141190437597 pan wye 0.5023618498923658 pan eks 0.5036718595143479 pan hat 0.492140950155604 pan zee 0.4966041598627583 eks pan 0.48507555383425127 eks wye 0.4838950517724162 eks zee 0.49546320772681596 eks eks 0.5227992666570941 eks hat 0.5006790659966355 wye wye 0.49150092785839306 wye pan 0.4996119901034838 wye hat 0.4977304763723314 wye zee 0.5059066170573692 wye eks 0.5306035254809106 zee pan 0.5198298297816007 zee wye 0.5142667998230479 zee eks 0.4883932942792647 zee zee 0.5127559183726382 zee hat 0.46772617655014515 hat wye 0.49381326184632596 hat zee 0.5099985721987774 hat eks 0.48587864619953547 hat hat 0.47993053101017374 hat pan 0.4643355557376876
Variance and standard deviation without/with oosvars¶
mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium
x_count 10000 x_sum 4986.019681679581 x_mean 0.49860196816795804 x_var 0.08426974433144457 x_stddev 0.29029251511440074
cat variance.mlr
@n += 1; @sumx += $x; @sumx2 += $x**2; end { @mean = @sumx / @n; @var = (@sumx2 - @mean * (2 * @sumx - @n * @mean)) / (@n - 1); @stddev = sqrt(@var); emitf @n, @sumx, @sumx2, @mean, @var, @stddev }
mlr --oxtab put -q -f variance.mlr data/medium
n 10000 sumx 4986.019681679581 sumx2 3328.652400179729 mean 0.49860196816795804 var 0.08426974433144456 stddev 0.2902925151144007
You can also do this keyed, of course, imitating the keyed-mean example above.
Min/max without/with oosvars¶
mlr --oxtab stats1 -a min,max -f x data/medium
x_min 0.00004509679127584487 x_max 0.999952670371898
mlr --oxtab put -q ' @x_min = min(@x_min, $x); @x_max = max(@x_max, $x); end{emitf @x_min, @x_max} ' data/medium
x_min 0.00004509679127584487 x_max 0.999952670371898
Keyed min/max without/with oosvars¶
mlr --opprint stats1 -a min,max -f x -g a data/medium
a x_min x_max pan 0.00020390740306253097 0.9994029107062516 eks 0.0006917972627396018 0.9988110946859143 wye 0.0001874794831505655 0.9998228522652893 zee 0.0005486114815762555 0.9994904324789629 hat 0.00004509679127584487 0.999952670371898
mlr --opprint --from data/medium put -q ' @min[$a] = min(@min[$a], $x); @max[$a] = max(@max[$a], $x); end{ emit (@min, @max), "a"; } '
a min max pan 0.00020390740306253097 0.9994029107062516 eks 0.0006917972627396018 0.9988110946859143 wye 0.0001874794831505655 0.9998228522652893 zee 0.0005486114815762555 0.9994904324789629 hat 0.00004509679127584487 0.999952670371898
Delta without/with oosvars¶
mlr --opprint step -a delta -f x data/small
a b i x y x_delta pan pan 1 0.346791 0.726802 0 eks pan 2 0.758679 0.522151 0.411888 wye wye 3 0.204603 0.338318 -0.554076 eks wye 4 0.381399 0.134188 0.17679599999999998 wye pan 5 0.573288 0.863624 0.19188900000000003
mlr --opprint put ' $x_delta = is_present(@last) ? $x - @last : 0; @last = $x ' data/small
a b i x y x_delta pan pan 1 0.346791 0.726802 0 eks pan 2 0.758679 0.522151 0.411888 wye wye 3 0.204603 0.338318 -0.554076 eks wye 4 0.381399 0.134188 0.17679599999999998 wye pan 5 0.573288 0.863624 0.19188900000000003
Keyed delta without/with oosvars¶
mlr --opprint step -a delta -f x -g a data/small
a b i x y x_delta pan pan 1 0.346791 0.726802 0 eks pan 2 0.758679 0.522151 0 wye wye 3 0.204603 0.338318 0 eks wye 4 0.381399 0.134188 -0.37728 wye pan 5 0.573288 0.863624 0.36868500000000004
mlr --opprint put ' $x_delta = is_present(@last[$a]) ? $x - @last[$a] : 0; @last[$a]=$x ' data/small
a b i x y x_delta pan pan 1 0.346791 0.726802 0 eks pan 2 0.758679 0.522151 0 wye wye 3 0.204603 0.338318 0 eks wye 4 0.381399 0.134188 -0.37728 wye pan 5 0.573288 0.863624 0.36868500000000004
Exponentially weighted moving averages without/with oosvars¶
mlr --opprint step -a ewma -d 0.1 -f x data/small
a b i x y x_ewma_0.1 pan pan 1 0.346791 0.726802 0.346791 eks pan 2 0.758679 0.522151 0.3879798 wye wye 3 0.204603 0.338318 0.36964211999999996 eks wye 4 0.381399 0.134188 0.37081780799999997 wye pan 5 0.573288 0.863624 0.3910648272
mlr --opprint put ' begin{ @a=0.1 }; $e = NR==1 ? $x : @a * $x + (1 - @a) * @e; @e=$e ' data/small
a b i x y e pan pan 1 0.346791 0.726802 0.346791 eks pan 2 0.758679 0.522151 0.3879798 wye wye 3 0.204603 0.338318 0.36964211999999996 eks wye 4 0.381399 0.134188 0.37081780799999997 wye pan 5 0.573288 0.863624 0.3910648272