Null/empty/absent data¶
One of Miller's key features is its support for heterogeneous data. For example, take mlr sort
: if you try to sort on field hostname
when not all records in the data stream have a field named hostname
, it is not an error (although you could pre-filter the data stream using mlr having-fields --at-least hostname then sort ...
). Rather, records lacking one or more sort keys are simply output contiguously by mlr sort
.
The three types¶
Miller has three kinds of null data:
-
Empty (key present, value empty): a field name is present in a record (or in an out-of-stream variable) with empty value: e.g.
x=,y=2
in the data input stream, or assignment$x=""
or@x=""
inmlr put
. -
Absent (key not present): a field name is not present, e.g. input record is
x=1,y=2
and aput
orfilter
expression refers to$z
. Or, reading an out-of-stream variable which hasn't been assigned a value yet, e.g.mlr put -q '@sum += $x; end{emit @sum}'
ormlr put -q '@sum[$a][$b] += $x; end{emit @sum, "a", "b"}'
. -
JSON null: The main purpose of this is to support reading the
null
type in JSON files. The Miller programming language has anull
keyword as well, so you can also write the null type using$x = null
. Additionally, though, when you write past the end of an array, leaving gaps -- e.g. writinga[12]
when the arraya
has length 10 -- JSON-null is used to fill the gaps. See also the arrays page.
You can test these programmatically using the functions is_empty
/is_not_empty
, is_absent
/is_present
, and is_null
/is_not_null
. For the last pair, note that null means either empty or absent. Here is a full list of such functions:
mlr -f | grep is_
is_absent is_array is_bool is_boolean is_empty is_empty_map is_error is_float is_int is_map is_nan is_nonempty_map is_not_array is_not_empty is_not_map is_not_null is_null is_numeric is_present is_string
Rules for null-handling¶
- Records with one or more empty sort-field values sort after records with all sort-field values present:
mlr cat data/sort-null.dat
a=3,b=2 a=1,b=8 a=,b=4 x=9,b=10 a=5,b=7
mlr sort -n a data/sort-null.dat
a=1,b=8 a=3,b=2 a=5,b=7 a=,b=4 x=9,b=10
mlr sort -nr a data/sort-null.dat
a=,b=4 a=5,b=7 a=3,b=2 a=1,b=8 x=9,b=10
- Most functions/operators which have one or more empty arguments produce empty output: e.g.
echo 'x=2,y=3' | mlr put '$a=$x+$y'
x=2,y=3,a=5
echo 'x=,y=3' | mlr put '$a=$x+$y'
x=,y=3,a=3
echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'
x=,y=3,a=,b=1.0986122886681096
with the exception that the min
and max
functions are special: if one argument is non-null, it wins:
echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'
x=,y=3,a=3,b=
Likewise, empty works like 0 for addition and subtraction, and multiplication:
echo 'x=,y=3' | mlr put '$a = $x + $y; $b = $x - $y; $c = $x * $y'
x=,y=3,a=3,b=-3,c=3
This is intended to follow the arithmetic rule for absent data (explained next). In particular:
- For file formats allowing for heterogeneity in keys, e.g. JSON, you should be able to keep a running sum of some field, say
$x
. If a given record doesn't have$x
, then$x
will be absent for that record, and the sum should simply continue. - For CSV and TSV, which don't allow for hetrogeneity in keys, the only way a value can be missing is to be empty. Here, if a given record doesn't have
$x
, then$x
will be empty for that record, and the sum should simply continue.
cat missings.json
[ { "a": "red", "x": 7 }, { "a": "green", "z": 242, "w": "zdatsyg" }, { "a": "blue", "x": 9 } ]
mlr --ijson --from missings.json put -q 'begin { @sum = 0 } @sum += $x; end { print @sum }'
16
cat missings.csv
a,x,z,w red,7,, green,,242,zdatsyg blue,9,,
mlr --icsv --from missings.csv put -q 'begin { @sum = 0 } @sum += $x; end { print @sum }'
16
- Functions of absent variables (e.g.
mlr put '$y = log10($nonesuch)'
) evaluate to absent, and arithmetic/bitwise/boolean operators with both operands being absent evaluate to absent. Arithmetic operators with one absent operand return the other operand. More specifically, absent values act like zero for addition/subtraction, and one for multiplication: Furthermore, any expression which evaluates to absent is not stored in the left-hand side of an assignment statement:
echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'
x=2,y=3,b=3,c=5
echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'
x=2,y=3,a=2,b=3
- Likewise, for assignment to maps, absent-valued keys or values result in a skipped assignment.
The reasoning is as follows:
-
Absent stream-record values should not break accumulations, since Miller by design handles heterogeneous data: the running
@sum
inmlr put '@sum += $x'
should not be invalidated for records which have nox
. -
Absent out-of-stream-variable values are precisely what allow you to write
mlr put '@sum += $x'
. Otherwise you would have to writemlr put 'begin{@sum = 0}; @sum += $x'
-- which is tolerable -- but formlr put 'begin{...}; @sum[$a][$b] += $x'
you'd have to pre-initialize@sum
for all values of$a
and$b
in your input data stream, which is intolerable. -
The penalty for the absent feature is that misspelled variables can be hard to find: e.g. in
mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}'
the accumulator is spelt@sumx
in the begin-block but@sum
in the end-block, where since it is absent,@sum*2
evaluates to 2. See also the section on DSL errors and transparency.
Absent-test functions¶
Since absent plus absent is absent (and likewise for other operators), accumulations such as @sum += $x
work correctly on heterogeneous data, as do within-record formulas if both operands are absent. If one operand is present, you may get behavior you don't desire. To work around this -- namely, to set an output field only for records which have all the inputs present -- you can use a pattern-action block with is_present
:
mlr cat data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true record_count=100,resource=/path/to/file resource=/path/to/second/file,loadsec=0.32,ok=true record_count=150,resource=/path/to/second/file resource=/some/other/path,loadsec=0.97,ok=false
mlr put 'is_present($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true,loadmillis=450 record_count=100,resource=/path/to/file resource=/path/to/second/file,loadsec=0.32,ok=true,loadmillis=320 record_count=150,resource=/path/to/second/file resource=/some/other/path,loadsec=0.97,ok=false,loadmillis=970
mlr put '$loadmillis = (is_present($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true,loadmillis=450 record_count=100,resource=/path/to/file,loadmillis=0 resource=/path/to/second/file,loadsec=0.32,ok=true,loadmillis=320 record_count=150,resource=/path/to/second/file,loadmillis=0 resource=/some/other/path,loadsec=0.97,ok=false,loadmillis=970
Arithmetic rules¶
If you're interested in a formal description of how empty and absent fields participate in arithmetic, here's a table for plus (other arithmetic/boolean/bitwise operators are similar):
mlr help type-arithmetic-info
(+) | 1 2.5 (empty) (absent) (error) ------ + ------ ------ ------ ------ ------ 1 | 2 3.5 1 1 (error) 2.5 | 3.5 5 2.5 2.5 (error) (empty) | 1 2.5 (empty) (absent) (error) (absent) | 1 2.5 (absent) (absent) (error) (error) | (error) (error) (error) (error) (error)