Record-heterogeneity¶
We think of CSV tables as rectangular: if there are 17 columns in the header then there are 17 columns for every row, else the data have a formatting error.
But heterogeneous data abound -- log-file entries, JSON documents, no-SQL databases such as MongoDB, etc. -- not to mention data-cleaning opportunities we'll look at in this page. Miller offers several ways to handle data heterogeneity.
Terminology, examples, and solutions¶
Different kinds of heterogeneous data include ragged, irregular, and sparse.
Homogeneous/rectangular data¶
A homogeneous list of records is one in which all records have the same keys, in the same order. For example, here is a well-formed CSV file:
mlr --csv cat data/het/hom.csv
a,b,c 1,2,3 4,5,6 7,8,9
It has three records (written here using JSON Lines formatting):
mlr --icsv --ojsonl cat data/het/hom.csv
{"a": 1, "b": 2, "c": 3} {"a": 4, "b": 5, "c": 6} {"a": 7, "b": 8, "c": 9}
Here every row has the same keys, in the same order: a,b,c
.
These are also sometimes called rectangular since if we pretty-print them we get a nice rectangle:
mlr --icsv --opprint cat data/het/hom.csv
a b c 1 2 3 4 5 6 7 8 9
Fillable data¶
A second example has some empty cells which could be filled:
mlr --csv cat data/het/fillable.csv
a,b,c 1,2,3 4,,6 ,8,9
mlr --icsv --ojsonl cat data/het/fillable.csv
{"a": 1, "b": 2, "c": 3} {"a": 4, "b": "", "c": 6} {"a": "", "b": 8, "c": 9}
This example is still homogeneous, though: every row has the same keys, in the same order: a,b,c
.
Empty values don't make the data heterogeneous.
Note however that we can use the fill-empty
verb to make these
values non-empty, if we like:
mlr --icsv --opprint fill-empty -v filler data/het/fillable.csv
a b c 1 2 3 4 filler 6 filler 8 9
Ragged data¶
Next let's look at non-well-formed CSV files. For a third example:
cat data/het/ragged.csv
a,b,c 1,2,3 4,5 7,8,9,10
If you mlr --csv cat
this, you'll get an error message:
mlr --csv cat data/het/ragged.csv
a,b,c 1,2,3 mlr: mlr: CSV header/data length mismatch 3 != 2 at filename data/het/ragged.csv row 3. .
There are two kinds of raggedness here. Since CSVs form records by zipping the
keys from the header line together with the values from each data line, the
second record has a missing value for key c
(which ought to be fillable),
while the third record has a value 10
with no key for it.
Using the --allow-ragged-csv-input
flag
we can fill values in too-short rows, and provide a key (column number starting
with 1) for too-long rows:
mlr --icsv --ojson --allow-ragged-csv-input cat data/het/ragged.csv
[ { "a": 1, "b": 2, "c": 3 }, { "a": 4, "b": 5, "c": "" }, { "a": 7, "b": 8, "c": 9, "4": 10 } ]
Irregular data¶
Here's another situation -- this file has, in some sense, the "same" data as
our ragged.csv
example above:
cat data/het/irregular.json
{"a": 1, "b": 2, "c": 3} {"c": 6, "a": 4, "b": 5} {"b": 8, "c": 9, "a": 7}
For example, on the second record, a
is 4, b
is 5, c
is 6. But this data
is heterogeneous because the keys a,b,c
aren't in the same order in each
record.
This kind of data arises often in practice. One reason is that, while many
programming languages (including the Miller DSL) preserve insertion
order in maps; others do
not. So someone might have written {"a":4,"b":5,"c":6}
in the source code,
but the data may not have printed that way into a given data file.
We can use the regularize
or
sort-within-records
verb to order
the keys:
mlr --jsonl regularize data/het/irregular.json
{"a": 1, "b": 2, "c": 3} {"a": 4, "b": 5, "c": 6} {"a": 7, "b": 8, "c": 9}
The regularize
verb tries to re-order subsequent rows to look like the first
(whatever order that is); the sort-within-records
verb simply uses
alphabetical order (which is the same in the above example where the first
record has keys in the order a,b,c
).
Sparse data¶
Here's another frequently occurring situation -- quite often, systems will log data for items which are present, but won't log data for items which aren't.
mlr --json cat data/het/sparse.json
[ { "host": "xy01.east", "status": "running", "volume": "/dev/sda1" }, { "host": "xy92.west", "status": "running" }, { "purpose": "failover", "host": "xy55.east", "volume": "/dev/sda1", "reimaged": true } ]
This data is called sparse (from the data-storage term).
We can use the unsparsify
verb to make sure
every record has the same keys:
mlr --json unsparsify data/het/sparse.json
[ { "host": "xy01.east", "status": "running", "volume": "/dev/sda1", "purpose": "", "reimaged": "" }, { "host": "xy92.west", "status": "running", "volume": "", "purpose": "", "reimaged": "" }, { "host": "xy55.east", "status": "", "volume": "/dev/sda1", "purpose": "failover", "reimaged": true } ]
Since this data is now homogeneous (rectangular), it pretty-prints nicely:
mlr --ijson --opprint unsparsify data/het/sparse.json
host status volume purpose reimaged xy01.east running /dev/sda1 - - xy92.west running - - - xy55.east - /dev/sda1 failover true
Reading and writing heterogeneous data¶
In the previous sections we saw different kinds of data heterogeneity, and ways to transform the data to make it homogeneous.
Non-rectangular file formats: JSON, XTAB, NIDX, DKVP¶
For these formats, record-heterogeneity comes naturally:
cat data/het/sparse.json
{ "host": "xy01.east", "status": "running", "volume": "/dev/sda1" } { "host": "xy92.west", "status": "running" } { "purpose": "failover", "host": "xy55.east", "volume": "/dev/sda1", "reimaged": true }
mlr --ijson --onidx --ofs ' ' cat data/het/sparse.json
xy01.east running /dev/sda1 xy92.west running failover xy55.east /dev/sda1 true
mlr --ijson --oxtab cat data/het/sparse.json
host xy01.east status running volume /dev/sda1 host xy92.west status running purpose failover host xy55.east volume /dev/sda1 reimaged true
mlr --ijson --odkvp cat data/het/sparse.json
host=xy01.east,status=running,volume=/dev/sda1 host=xy92.west,status=running purpose=failover,host=xy55.east,volume=/dev/sda1,reimaged=true
Even then, we may wish to put like with like, using the group-like
verb:
mlr --ijson --odkvp cat data/het.json
resource=/path/to/file,loadsec=0.45,ok=true record_count=100,resource=/path/to/file resource=/path/to/second/file,loadsec=0.32,ok=true record_count=150,resource=/path/to/second/file resource=/some/other/path,loadsec=0.97,ok=false
mlr --ijson --odkvp group-like data/het.json
resource=/path/to/file,loadsec=0.45,ok=true resource=/path/to/second/file,loadsec=0.32,ok=true resource=/some/other/path,loadsec=0.97,ok=false record_count=100,resource=/path/to/file record_count=150,resource=/path/to/second/file
Rectangular file formats: CSV and pretty-print¶
CSV and pretty-print formats expect rectangular structure. But Miller lets you process non-rectangular using CSV and pretty-print.
Miller simply prints a newline and a new header when there is a schema change -- where by schema we mean simply the list of record keys in the order they are encountered. When there is no schema change, you get CSV per se as a special case. Likewise, Miller reads heterogeneous CSV or pretty-print input the same way. The difference between CSV and CSV-lite is that the former is RFC-4180-compliant, while the latter readily handles heterogeneous data (which is non-compliant). For example:
cat data/het.json
{ "resource": "/path/to/file", "loadsec": 0.45, "ok": true } { "record_count": 100, "resource": "/path/to/file" } { "resource": "/path/to/second/file", "loadsec": 0.32, "ok": true } { "record_count": 150, "resource": "/path/to/second/file" } { "resource": "/some/other/path", "loadsec": 0.97, "ok": false }
mlr --ijson --opprint cat data/het.json
resource loadsec ok /path/to/file 0.45 true record_count resource 100 /path/to/file resource loadsec ok /path/to/second/file 0.32 true record_count resource 150 /path/to/second/file resource loadsec ok /some/other/path 0.97 false
mlr --ijson --opprint group-like data/het.json
resource loadsec ok /path/to/file 0.45 true /path/to/second/file 0.32 true /some/other/path 0.97 false record_count resource 100 /path/to/file 150 /path/to/second/file
Miller handles explicit header changes as just shown. If your CSV input contains ragged data -- if there are implicit header changes (no intervening blank line and new header line) as seen above -- you can use --allow-ragged-csv-input
(or keystroke-saver --ragged
).
mlr --csv --ragged cat data/het/ragged.csv
a,b,c 1,2,3 4,5, a,b,c,4 7,8,9,10
Processing heterogeneous data¶
Above we saw how to make heterogeneous data homogeneous, and then how to print heterogeneous data. As for other processing, record-heterogeneity is not a problem for Miller.
Miller operates on specified fields and takes the rest along: for example, if
you are sorting on the count
field then all records in the input stream must
have a count
field but the other fields can vary, and moreover the sorted-on
field name(s) don't need to be in the same position on each line:
cat data/sort-het.dkvp
count=500,color=green count=600 status=ok,count=250,hours=0.22 status=ok,count=200,hours=3.4 count=300,color=blue count=100,color=green count=450
mlr sort -n count data/sort-het.dkvp
count=100,color=green status=ok,count=200,hours=3.4 status=ok,count=250,hours=0.22 count=300,color=blue count=450 count=500,color=green count=600