Null/empty/absent data¶
One of Miller's key features is its support for heterogeneous data. For example, take mlr sort: if you try to sort on field hostname when not all records in the data stream have a field named hostname, it is not an error (although you could pre-filter the data stream using mlr having-fields --at-least hostname then sort ...). Rather, records lacking one or more sort keys are simply output contiguously by mlr sort.
The three types¶
Miller has three kinds of null data:
-
Empty (key present, value empty): a field name is present in a record (or in an out-of-stream variable) with empty value: e.g.
x=,y=2in the data input stream, or assignment$x=""or@x=""inmlr put. -
Absent (key not present): a field name is not present, e.g. input record is
x=1,y=2and aputorfilterexpression refers to$z. Or, reading an out-of-stream variable which hasn't been assigned a value yet, e.g.mlr put -q '@sum += $x; end{emit @sum}'ormlr put -q '@sum[$a][$b] += $x; end{emit @sum, "a", "b"}'. -
JSON null: The main purpose of this is to support reading the
nulltype in JSON files. The Miller programming language has anullkeyword as well, so you can also write the null type using$x = null. Additionally, though, when you write past the end of an array, leaving gaps -- e.g. writinga[12]when the arrayahas length 10 -- JSON-null is used to fill the gaps. See also the arrays page.
You can test these programmatically using the functions is_empty/is_not_empty, is_absent/is_present, and is_null/is_not_null. For the last pair, note that null means either empty or absent. Here is a full list of such functions:
mlr -f | grep is_
is_absent is_array is_bool is_boolean is_empty is_empty_map is_error is_float is_int is_map is_nan is_nonempty_map is_not_array is_not_empty is_not_map is_not_null is_null is_numeric is_present is_string
Rules for null-handling¶
- Records with one or more empty sort-field values sort after records with all sort-field values present:
mlr cat data/sort-null.dat
a=3,b=2 a=1,b=8 a=,b=4 x=9,b=10 a=5,b=7
mlr sort -n a data/sort-null.dat
a=1,b=8 a=3,b=2 a=5,b=7 a=,b=4 x=9,b=10
mlr sort -nr a data/sort-null.dat
a=,b=4 a=5,b=7 a=3,b=2 a=1,b=8 x=9,b=10
- Most functions/operators which have one or more empty arguments produce empty output: e.g.
echo 'x=2,y=3' | mlr put '$a=$x+$y'
x=2,y=3,a=5
echo 'x=,y=3' | mlr put '$a=$x+$y'
x=,y=3,a=3
echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'
x=,y=3,a=,b=1.0986122886681096
with the exception that the min and max functions are special: if one argument is non-null, it wins:
echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'
x=,y=3,a=3,b=
Likewise, empty works like 0 for addition and subtraction, and like 1 for multiplication:
echo 'x=,y=3' | mlr put '$a = $x + $y; $b = $x - $y; $c = $x * $y'
x=,y=3,a=3,b=-3,c=3
This is intended to follow the arithmetic rule for absent data (explained next). In particular:
- For file formats allowing for heterogeneity in keys, e.g. JSON, you should be able to keep a running sum of some field, say
$x. If a given record doesn't have$x, then$xwill be absent for that record, and the sum should simply continue. - For CSV and TSV, which don't allow for hetrogeneity in keys, the only way a value can be missing is to be empty. Here, if a given record doesn't have
$x, then$xwill be empty for that record, and the sum should simply continue.
cat missings.json
[
{ "a": "red", "x": 7 },
{ "a": "green", "z": 242, "w": "zdatsyg" },
{ "a": "blue", "x": 9 }
]
mlr --ijson --from missings.json put -q 'begin { @sum = 0 } @sum += $x; end { print @sum }'
16
cat missings.csv
a,x,z,w red,7,, green,,242,zdatsyg blue,9,,
mlr --icsv --from missings.csv put -q 'begin { @sum = 0 } @sum += $x; end { print @sum }'
16
- Functions of absent variables (e.g.
mlr put '$y = log10($nonesuch)') evaluate to absent, and arithmetic/bitwise/boolean operators with both operands being absent evaluate to absent. Arithmetic operators with one absent operand return the other operand. More specifically, absent values act like zero for addition/subtraction, and one for multiplication: Furthermore, any expression which evaluates to absent is not stored in the left-hand side of an assignment statement:
echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'
x=2,y=3,b=3,c=5
echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'
x=2,y=3,a=2,b=3
- Likewise, for assignment to maps, absent-valued keys or values result in a skipped assignment.
The reasoning is as follows:
-
Absent stream-record values should not break accumulations, since Miller by design handles heterogeneous data: the running
@suminmlr put '@sum += $x'should not be invalidated for records which have nox. -
Absent out-of-stream-variable values are precisely what allow you to write
mlr put '@sum += $x'. Otherwise you would have to writemlr put 'begin{@sum = 0}; @sum += $x'-- which is tolerable -- but formlr put 'begin{...}; @sum[$a][$b] += $x'you'd have to pre-initialize@sumfor all values of$aand$bin your input data stream, which is intolerable. -
The penalty for the absent feature is that misspelled variables can be hard to find: e.g. in
mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}'the accumulator is spelt@sumxin the begin-block but@sumin the end-block, where since it is absent,@sum*2evaluates to 2. See also the section on DSL errors and transparency.
Absent-test functions¶
Since absent plus absent is absent (and likewise for other operators), accumulations such as @sum += $x work correctly on heterogeneous data, as do within-record formulas if both operands are absent. If one operand is present, you may get behavior you don't desire. To work around this -- namely, to set an output field only for records which have all the inputs present -- you can use a pattern-action block with is_present:
mlr cat data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true record_count=100,resource=/path/to/file resource=/path/to/second/file,loadsec=0.32,ok=true record_count=150,resource=/path/to/second/file resource=/some/other/path,loadsec=0.97,ok=false
mlr put 'is_present($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true,loadmillis=450 record_count=100,resource=/path/to/file resource=/path/to/second/file,loadsec=0.32,ok=true,loadmillis=320 record_count=150,resource=/path/to/second/file resource=/some/other/path,loadsec=0.97,ok=false,loadmillis=970
mlr put '$loadmillis = (is_present($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp
resource=/path/to/file,loadsec=0.45,ok=true,loadmillis=450 record_count=100,resource=/path/to/file,loadmillis=0 resource=/path/to/second/file,loadsec=0.32,ok=true,loadmillis=320 record_count=150,resource=/path/to/second/file,loadmillis=0 resource=/some/other/path,loadsec=0.97,ok=false,loadmillis=970
Arithmetic rules¶
If you're interested in a formal description of how empty and absent fields participate in arithmetic, here's a table for +, &&, and||`. Notes:
- Other arithmetic, boolean, and bitwise operators besides
&&and||are similar to+. - The
&&and||obey short-circuiting semantics. That is: false && XisfalseandXis not evaluated even if it is a complex expression (maybe including function calls)true || XistrueandXis not evaluated even if it is a complex expression (maybe including function calls)- This means in particular that:
false && Xis false even ifXis an error, a non-boolean type, etc.true || Xis true even ifXis an error, a non-boolean type, etc.
mlr help type-arithmetic-info-extended
(+) | 1 2.5 true (empty) (absent) (error) ------ + ------ ------ ------ ------ ------ ------ 1 | 2 3.5 (error) 1 1 (error) 2.5 | 3.5 5 (error) 2.5 2.5 (error) true | (error) (error) (error) (error) (error) (error) (empty) | 1 2.5 (error) (empty) (absent) (error) (absent) | 1 2.5 (error) (absent) (absent) (error) (error) | (error) (error) (error) (error) (error) (error) (&&) | true false 3 (empty) (absent) (error) ------ + ------ ------ ------ ------ ------ ------ true | true false (error) (error) (absent) (error) false | false false false false false false 3 | (error) (error) (error) (error) (absent) (error) (empty) | true false (error) (error) (absent) (error) (absent) | true false (error) (absent) (absent) (error) (error) | (error) (error) (error) (error) (error) (error) (||) | true false 3 (empty) (absent) (error) ------ + ------ ------ ------ ------ ------ ------ true | true true true true true true false | true false (error) (error) (absent) (error) 3 | (error) (error) (error) (error) (absent) (error) (empty) | true false (error) (error) (absent) (error) (absent) | true false (error) (absent) (absent) (error) (error) | (error) (error) (error) (error) (error) (error)