Cookbook part 3: Stats with and without out-of-stream variables

Cookbook part 3: Stats with and without out-of-stream variables

Overview

One of Miller’s strengths is its compact notation: for example, given input of the form

 head -n 5 ../data/medium
 a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797
 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776
 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463
 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729

you can simply do

 mlr --oxtab stats1 -a sum -f x ../data/medium
 x_sum 4986.019682

or

 mlr --opprint stats1 -a sum -f x -g b ../data/medium
 b   x_sum
 pan 965.763670
 wye 1023.548470
 zee 979.742016
 eks 1016.772857
 hat 1000.192668

rather than the more tedious

 mlr --oxtab put -q '
   @x_sum += $x;
   end {
     emit @x_sum
   }
 ' data/medium
 x_sum 4986.019682

or

 mlr --opprint put -q '
   @x_sum[$b] += $x;
   end {
     emit @x_sum, "b"
   }
 ' data/medium
 b   x_sum
 pan 965.763670
 wye 1023.548470
 zee 979.742016
 eks 1016.772857
 hat 1000.192668

The former (mlr stats1 et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.

Nonetheless, out-of-stream variables (which I whimsically call oosvars), begin/end blocks, and emit statements give you the ability to implement logic – if you wish to do so – which isn’t present in other Miller verbs. (If you find yourself often using the same out-of-stream-variable logic over and over, please file a request at https://github.com/johnkerl/miller/issues to get it implemented directly in C as a Miller verb of its own.)

The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.

Mean without/with oosvars

 mlr --opprint stats1 -a mean -f x data/medium
 x_mean
 0.498602
 mlr --opprint put -q '
   @x_sum += $x;
   @x_count += 1;
   end {
     @x_mean = @x_sum / @x_count;
     emit @x_mean
   }
 ' data/medium
 x_mean
 0.498602

Keyed mean without/with oosvars

 mlr --opprint stats1 -a mean -f x -g a,b data/medium
 a   b   x_mean
 pan pan 0.513314
 eks pan 0.485076
 wye wye 0.491501
 eks wye 0.483895
 wye pan 0.499612
 zee pan 0.519830
 eks zee 0.495463
 zee wye 0.514267
 hat wye 0.493813
 pan wye 0.502362
 zee eks 0.488393
 hat zee 0.509999
 hat eks 0.485879
 wye hat 0.497730
 pan eks 0.503672
 eks eks 0.522799
 hat hat 0.479931
 hat pan 0.464336
 zee zee 0.512756
 pan hat 0.492141
 pan zee 0.496604
 zee hat 0.467726
 wye zee 0.505907
 eks hat 0.500679
 wye eks 0.530604
 mlr --opprint put -q '
   @x_sum[$a][$b] += $x;
   @x_count[$a][$b] += 1;
   end{
     for ((a, b), v in @x_sum) {
       @x_mean[a][b] = @x_sum[a][b] / @x_count[a][b];
     }
     emit @x_mean, "a", "b"
   }
 ' data/medium
 a   b   x_mean
 pan pan 0.513314
 pan wye 0.502362
 pan eks 0.503672
 pan hat 0.492141
 pan zee 0.496604
 eks pan 0.485076
 eks wye 0.483895
 eks zee 0.495463
 eks eks 0.522799
 eks hat 0.500679
 wye wye 0.491501
 wye pan 0.499612
 wye hat 0.497730
 wye zee 0.505907
 wye eks 0.530604
 zee pan 0.519830
 zee wye 0.514267
 zee eks 0.488393
 zee zee 0.512756
 zee hat 0.467726
 hat wye 0.493813
 hat zee 0.509999
 hat eks 0.485879
 hat hat 0.479931
 hat pan 0.464336

Variance and standard deviation without/with oosvars

 mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium
 x_count  10000
 x_sum    4986.019682
 x_mean   0.498602
 x_var    0.084270
 x_stddev 0.290293
 cat variance.mlr
 @n += 1;
 @sumx += $x;
 @sumx2 += $x**2;
 end {
   @mean = @sumx / @n;
   @var = (@sumx2 - @mean * (2 * @sumx - @n * @mean)) / (@n - 1);
   @stddev = sqrt(@var);
   emitf @n, @sumx, @sumx2, @mean, @var, @stddev
 }
 mlr --oxtab put -q -f variance.mlr data/medium
 n      10000
 sumx   4986.019682
 sumx2  3328.652400
 mean   0.498602
 var    0.084270
 stddev 0.290293

You can also do this keyed, of course, imitating the keyed-mean example above.

Min/max without/with oosvars

 mlr --oxtab stats1 -a min,max -f x data/medium
 x_min 0.000045
 x_max 0.999953
 mlr --oxtab put -q '@x_min = min(@x_min, $x); @x_max = max(@x_max, $x); end{emitf @x_min, @x_max}' data/medium
 x_min 0.000045
 x_max 0.999953

Keyed min/max without/with oosvars

 mlr --opprint stats1 -a min,max -f x -g a data/medium
 a   x_min    x_max
 pan 0.000204 0.999403
 eks 0.000692 0.998811
 wye 0.000187 0.999823
 zee 0.000549 0.999490
 hat 0.000045 0.999953
 mlr --opprint --from data/medium put -q '
   @min[$a] = min(@min[$a], $x);
   @max[$a] = max(@max[$a], $x);
   end{
     emit (@min, @max), "a";
   }
 '
 a   min      max
 pan 0.000204 0.999403
 eks 0.000692 0.998811
 wye 0.000187 0.999823
 zee 0.000549 0.999490
 hat 0.000045 0.999953

Delta without/with oosvars

 mlr --opprint step -a delta -f x data/small
 a   b   i x                   y                   x_delta
 pan pan 1 0.3467901443380824  0.7268028627434533  0
 eks pan 2 0.7586799647899636  0.5221511083334797  0.411890
 wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077
 eks wye 4 0.38139939387114097 0.13418874328430463 0.176796
 wye pan 5 0.5732889198020006  0.8636244699032729  0.191890
 mlr --opprint put '$x_delta = is_present(@last) ? $x - @last : 0; @last = $x' data/small
 a   b   i x                   y                   x_delta
 pan pan 1 0.3467901443380824  0.7268028627434533  0
 eks pan 2 0.7586799647899636  0.5221511083334797  0.411890
 wye wye 3 0.20460330576630303 0.33831852551664776 -0.554077
 eks wye 4 0.38139939387114097 0.13418874328430463 0.176796
 wye pan 5 0.5732889198020006  0.8636244699032729  0.191890

Keyed delta without/with oosvars

 mlr --opprint step -a delta -f x -g a data/small
 a   b   i x                   y                   x_delta
 pan pan 1 0.3467901443380824  0.7268028627434533  0
 eks pan 2 0.7586799647899636  0.5221511083334797  0
 wye wye 3 0.20460330576630303 0.33831852551664776 0
 eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281
 wye pan 5 0.5732889198020006  0.8636244699032729  0.368686
 mlr --opprint put '$x_delta = is_present(@last[$a]) ? $x - @last[$a] : 0; @last[$a]=$x' data/small
 a   b   i x                   y                   x_delta
 pan pan 1 0.3467901443380824  0.7268028627434533  0
 eks pan 2 0.7586799647899636  0.5221511083334797  0
 wye wye 3 0.20460330576630303 0.33831852551664776 0
 eks wye 4 0.38139939387114097 0.13418874328430463 -0.377281
 wye pan 5 0.5732889198020006  0.8636244699032729  0.368686

Exponentially weighted moving averages without/with oosvars

 mlr --opprint step -a ewma -d 0.1 -f x data/small
 a   b   i x                   y                   x_ewma_0.1
 pan pan 1 0.3467901443380824  0.7268028627434533  0.346790
 eks pan 2 0.7586799647899636  0.5221511083334797  0.387979
 wye wye 3 0.20460330576630303 0.33831852551664776 0.369642
 eks wye 4 0.38139939387114097 0.13418874328430463 0.370817
 wye pan 5 0.5732889198020006  0.8636244699032729  0.391064
 mlr --opprint put '
   begin{ @a=0.1 };
   $e = NR==1 ? $x : @a * $x + (1 - @a) * @e;
   @e=$e
 ' data/small
 a   b   i x                   y                   e
 pan pan 1 0.3467901443380824  0.7268028627434533  0.346790
 eks pan 2 0.7586799647899636  0.5221511083334797  0.387979
 wye wye 3 0.20460330576630303 0.33831852551664776 0.369642
 eks wye 4 0.38139939387114097 0.13418874328430463 0.370817
 wye pan 5 0.5732889198020006  0.8636244699032729  0.391064