Shapes of data¶
No output at all¶
Try od -xcv
and/or cat -e
on your file to check for non-printable characters.
If you're using Miller version less than 5.0.0 (try mlr --version
on your system to find out), when the line-ending-autodetect feature was introduced, please see http://johnkerl.org/miller-releases/miller-4.5.0/doc/index.html.
Fields not selected¶
Check the field-separators of the data, e.g. with the command-line head
program. Example: for CSV, Miller's default record separator is comma; if your data is tab-delimited, e.g. aTABbTABc
, then Miller won't find three fields named a
, b
, and c
but rather just one named aTABbTABc
. Solution in this case: mlr --fs tab {remaining arguments ...}
.
Also try od -xcv
and/or cat -e
on your file to check for non-printable characters.
Diagnosing delimiter specifications¶
Use the file
command to see if there are CR/LF terminators (in this case, there are not):
file data/colours.csv
data/colours.csv: Unicode text, UTF-8 text
Look at the file to find names of fields:
cat data/colours.csv
KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah
Extract a few fields:
mlr --csv cut -f KEY,PL,TO data/colours.csv
Use XTAB output format to get a sharper picture of where records/fields are being split:
mlr --icsv --oxtab cat data/colours.csv
KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz KEY;DE;EN;ES;FI;FR;IT;NL;PL;TO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah
Using XTAB output format makes it clearer that KEY;DE;...;TR
is being treated as a single field name in the CSV header, and likewise each subsequent line is being treated as a single field value. This is because the default field separator is a comma but we have semicolons here. Use XTAB again with different field separator (--fs semicolon
):
mlr --icsv --ifs semicolon --oxtab cat data/colours.csv
KEY masterdata_colourcode_1 DE Weiß EN White ES Blanco FI Valkoinen FR Blanc IT Bianco NL Wit PL Biały TO Alb TR Beyaz KEY masterdata_colourcode_2 DE Schwarz EN Black ES Negro FI Musta FR Noir IT Nero NL Zwart PL Czarny TO Negru TR Siyah
Using the new field-separator, retry the cut:
mlr --csv --fs semicolon cut -f KEY,PL,TO data/colours.csv
KEY;PL;TO masterdata_colourcode_1;Biały;Alb masterdata_colourcode_2;Czarny;Negru
I assigned $9 and it's not 9th¶
Miller records are ordered lists of key-value pairs. For NIDX format, DKVP format when keys are missing, or CSV/CSV-lite format with --implicit-csv-header
, Miller will sequentially assign keys of the form 1
, 2
, etc. But these are not integer array indices: they're just field names taken from the initial field ordering in the input data, when it was originally read from the input file(s).
echo x,y,z | mlr --dkvp cat
1=x,2=y,3=z
echo x,y,z | mlr --dkvp put '$6="a";$4="b";$55="cde"'
1=x,2=y,3=z,6=a,4=b,55=cde
echo x,y,z | mlr --nidx cat
x,y,z
echo x,y,z | mlr --csv --implicit-csv-header cat
1,2,3 x,y,z
echo x,y,z | mlr --dkvp rename 2,999
1=x,999=y,3=z
echo x,y,z | mlr --dkvp rename 2,newname
1=x,newname=y,3=z
echo x,y,z | mlr --csv --implicit-csv-header reorder -f 3,1,2
3,1,2 z,x,y
Why doesn't mlr cut put fields in the order I want?¶
Example: columns rate,shape,flag
were requested but they appear here in the order shape,flag,rate
:
cat example.csv
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 red,square,true,2,15,79.2778,0.0130 red,circle,true,3,16,13.8103,2.9010 red,square,false,4,48,77.5542,7.4670 purple,triangle,false,5,51,81.2290,8.5910 red,square,false,6,64,77.1991,9.5310 purple,triangle,false,7,65,80.1405,5.8240 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350 purple,square,false,10,91,72.3735,8.2430
mlr --csv cut -f rate,shape,flag example.csv
shape,flag,rate triangle,true,9.8870 square,true,0.0130 circle,true,2.9010 square,false,7.4670 triangle,false,8.5910 square,false,9.5310 triangle,false,5.8240 circle,true,4.2370 circle,true,8.3350 square,false,8.2430
The issue is that Miller's cut
, by default, outputs cut fields in the order they appear in the input data. This design decision was made intentionally to parallel the Unix/Linux system cut
command, which has the same semantics.
The solution is to use the -o
option:
mlr --csv cut -o -f rate,shape,flag example.csv
rate,shape,flag 9.8870,triangle,true 0.0130,square,true 2.9010,circle,true 7.4670,square,false 8.5910,triangle,false 9.5310,square,false 5.8240,triangle,false 4.2370,circle,true 8.3350,circle,true 8.2430,square,false
Numbering and renumbering records¶
The awk
-like built-in variable NR
is incremented for each input record:
cat example.csv
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 red,square,true,2,15,79.2778,0.0130 red,circle,true,3,16,13.8103,2.9010 red,square,false,4,48,77.5542,7.4670 purple,triangle,false,5,51,81.2290,8.5910 red,square,false,6,64,77.1991,9.5310 purple,triangle,false,7,65,80.1405,5.8240 yellow,circle,true,8,73,63.9785,4.2370 yellow,circle,true,9,87,63.5058,8.3350 purple,square,false,10,91,72.3735,8.2430
mlr --csv put '$nr = NR' example.csv
color,shape,flag,k,index,quantity,rate,nr yellow,triangle,true,1,11,43.6498,9.8870,1 red,square,true,2,15,79.2778,0.0130,2 red,circle,true,3,16,13.8103,2.9010,3 red,square,false,4,48,77.5542,7.4670,4 purple,triangle,false,5,51,81.2290,8.5910,5 red,square,false,6,64,77.1991,9.5310,6 purple,triangle,false,7,65,80.1405,5.8240,7 yellow,circle,true,8,73,63.9785,4.2370,8 yellow,circle,true,9,87,63.5058,8.3350,9 purple,square,false,10,91,72.3735,8.2430,10
However, this is the record number within the original input stream -- not after any filtering you may have done:
mlr --csv filter '$color == "yellow"' then put '$nr = NR' example.csv
color,shape,flag,k,index,quantity,rate,nr yellow,triangle,true,1,11,43.6498,9.8870,1 yellow,circle,true,8,73,63.9785,4.2370,8 yellow,circle,true,9,87,63.5058,8.3350,9
There are two good options here. One is to use the cat
verb with -n
:
mlr --csv filter '$color == "yellow"' then cat -n example.csv
n,color,shape,flag,k,index,quantity,rate 1,yellow,triangle,true,1,11,43.6498,9.8870 2,yellow,circle,true,8,73,63.9785,4.2370 3,yellow,circle,true,9,87,63.5058,8.3350
The other is to keep your own counter within the put
DSL:
mlr --csv filter '$color == "yellow"' then put 'begin {@n = 1} $n = @n; @n += 1' example.csv
color,shape,flag,k,index,quantity,rate,n yellow,triangle,true,1,11,43.6498,9.8870,1 yellow,circle,true,8,73,63.9785,4.2370,2 yellow,circle,true,9,87,63.5058,8.3350,3
The difference is a matter of taste (although mlr cat -n
puts the counter first).
Options for dealing with duplicate rows¶
If your data has records appearing multiple times, you can use uniq to show and/or count the unique records.
If you want to look at partial uniqueness -- for example, show only the first record for each unique combination of the account_id
and account_status
fields -- you might use mlr head -n 1 -g account_id,account_status
. Please also see head.
Rectangularizing data¶
Suppose you have a method (in whatever language) which is printing things of the form
outer=1 outer=2 outer=3
and then calls another method which prints things of the form
middle=10 middle=11 middle=12 middle=20 middle=21 middle=30 middle=31
and then, perhaps, that second method calls a third method which prints things of the form
inner1=100,inner2=101 inner1=120,inner2=121 inner1=200,inner2=201 inner1=210,inner2=211 inner1=300,inner2=301 inner1=312 inner1=313,inner2=314
with the result that your program's output is
outer=1 middle=10 inner1=100,inner2=101 middle=11 middle=12 inner1=120,inner2=121 outer=2 middle=20 inner1=200,inner2=201 middle=21 inner1=210,inner2=211 outer=3 middle=30 inner1=300,inner2=301 middle=31 inner1=312 inner1=313,inner2=314
The idea here is that middles starting with a 1 belong to the outer value of 1, and so on. (For example, the outer values might be account IDs, the middle values might be invoice IDs, and the inner values might be invoice line-items.) If you want all the middle and inner lines to have the context of which outers they belong to, you can modify your software to pass all those through your methods. Alternatively, don't refactor your code just to handle some ad-hoc log-data formatting -- instead, use the following to rectangularize the data. The idea is to use an out-of-stream variable to accumulate fields across records. Clear that variable when you see an outer ID; accumulate fields; emit output when you see the inner IDs.
mlr --from data/rect.txt put -q ' is_present($outer) { unset @r } for (k, v in $*) { @r[k] = v } is_present($inner1) { emit @r }'
outer=1,middle=10,inner1=100,inner2=101 outer=1,middle=12,inner1=120,inner2=121 outer=2,middle=20,inner1=200,inner2=201 outer=2,middle=21,inner1=210,inner2=211 outer=3,middle=30,inner1=300,inner2=301 outer=3,middle=31,inner1=312,inner2=301 outer=3,middle=31,inner1=313,inner2=314
See also the record-heterogeneity page; see in
particular the regularize
verb for a way to
do this with much less keystroking.