Quick links: Flags Verbs Functions Glossary Release docs

CSV, with and without headers¶

Headerless CSV on input or output¶

Sometimes we get CSV files which lack a header. For example, data/headerless.csv:

cat data/headerless.csv

John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

You can use Miller to add a header. The --implicit-csv-header applies positionally indexed labels:

mlr --csv --implicit-csv-header cat data/headerless.csv

1,2,3
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

Following that, you can rename the positionally indexed labels to names with meaning for your context. For example:

mlr --csv --implicit-csv-header label name,age,status data/headerless.csv

name,age,status
John,23,present
Fred,34,present
Alice,56,missing
Carol,45,present

Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller's output to the system command sed 1d, or you can use Miller's --headerless-csv-output option:

head -5 data/colored-shapes.dkvp | mlr --ocsv cat

color,shape,flag,i,u,v,w,x
yellow,triangle,1,56,0.632170,0.988721,0.436498,5.798188
red,square,1,80,0.219668,0.001257,0.792778,2.944117
red,circle,1,84,0.209017,0.290052,0.138103,5.065034
red,square,0,243,0.956274,0.746720,0.775542,7.117831
purple,triangle,0,257,0.435535,0.859129,0.812290,5.753095

head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat

yellow,triangle,1,56,0.632170,0.988721,0.436498,5.798188
red,square,1,80,0.219668,0.001257,0.792778,2.944117
red,circle,1,84,0.209017,0.290052,0.138103,5.065034
red,square,0,243,0.956274,0.746720,0.775542,7.117831
purple,triangle,0,257,0.435535,0.859129,0.812290,5.753095

Lastly, often we say "CSV" or "TSV" when we have positionally indexed data in columns which are separated by commas or tabs, respectively. In this case it's perhaps simpler to just use NIDX format which was designed for this purpose. (See also File Formats.) For example:

mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv

1 John
3 present

1 Fred
3 present

1 Alice
3 missing

1 Carol
3 present

Headerless CSV with duplicate field values¶

Miller's default treatment of CSV data, in the absence of any other adjectives, is always CSV with header. However, lots of folks think of CSV data -- comma-separated values -- as just that, values, with no header.

Here's some sample CSV data which is values-only, i.e. headerless:

cat data/nas.csv

-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958

There are clearly nine fields here, but if we try to have Miller parse it as CSV, we see something happened:

mlr --csv cat data/nas.csv

-349801.10097848,4537221.43295653,2,1,NA,NA_2,NA_3,NA_4,NA_5
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958

What happened?

Miller is (by central design) a mapping from name to value, rather than integer position to value as in most tools in the Unix toolkit such as sort, cut, awk, etc. And its default behavior with repeated column/field names is to append _2, _3, etc to dedupe them. So given input Yea=1,Yea=2 on the same input line, first Yea=1 is stored, then Yea_2=2. This is in the input-parser.

Here, the first data line is being seen as a header line, and the repeated NA values are being seen as duplicate keys that need to be deduplicated.

One solution is to use --implicit-csv-header, or its shorter alias --hi:

mlr --csv --hi cat data/nas.csv

1,2,3,4,5,6,7,8,9
-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958

Another solution is to use NIDX format:

mlr --inidx --ifs comma --ocsv cat data/nas.csv

1,2,3,4,5,6,7,8,9
-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958

Either way, since there is no explicit header, fields are named 1 through 9. We can use the label verb to apply more meaningful namees:

mlr --csv --hi cat then label xsn,ysn,x,y,t,a,e29,e31,e32 data/nas.csv

xsn,ysn,x,y,t,a,e29,e31,e32
-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958

mlr --inidx --ifs comma --ocsv cat then label xsn,ysn,x,y,t,a,e29,e31,e32 data/nas.csv

xsn,ysn,x,y,t,a,e29,e31,e32
-349801.10097848,4537221.43295653,2,1,NA,NA,NA,NA,NA
-338681.59578181,4537221.43295653,14,1,13.1,1,0.978,0.964,0.964
-334975.09404959,4537221.43295653,18,1,13.1,1,NA,NA,NA
-332195.21775042,4537221.43295653,21,1,13.1,1,0.978,0.974,0.96
-331268.59231736,4537221.43295653,22,1,13.1,1,0.978,0.978,0.962
-330341.96688431,4537221.43295653,23,1,13.1,1,0.978,0.978,0.962
-326635.46515209,4537221.43295653,27,1,13.1,2,0.978,0.972,0.958

Regularizing ragged CSV¶

Miller handles RFC-4180-compliant CSV: in particular, it's an error if the number of data fields in a given data line don't match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line.

cat data/ragged.csv

a,b,c
1,2,3
4,5
6,7,8,9

mlr --from data/ragged.csv --fs comma --nidx put '
  @maxnf = max(@maxnf, NF);
  @nf = NF;
  while(@nf < @maxnf) {
    @nf += 1;
    $[@nf] = ""
  }
'

a,b,c
1,2,3
4,5,
6,7,8,9

or, more simply,

mlr --from data/ragged.csv --fs comma --nidx put '
  @maxnf = max(@maxnf, NF);
  while(NF < @maxnf) {
    $[NF+1] = "";
  }
'

a,b,c
1,2,3
4,5,
6,7,8,9