Separators¶
Record, field, and pair separators¶
Miller has record separators, field separators, and pair separators. For example, given the following DKVP records:
cat data/a.dkvp
a=1,b=2,c=3 a=4,b=5,c=6
- the record separator is newline -- it separates records from one another;
- the field separator is
,
-- it separates fields (key-value pairs) from one another; - and the pair separator is
=
-- it separates the key from the value within each key-value pair.
These are the default values, which you can override with flags such as --ips
and --ops
(below).
Not all file formats have all three of these: for example, CSV does not have a pair separator, since keys are on the header line and values are on each data line.
Also, separators are not programmable for all file formats. For example, in
JSON objects, the pair separator is :
and the
field-separator is ,
-- we write {"a":1,"b":2,"c":3}
-- but these aren't
modifiable. If you do mlr --json --ips : --ips '=' cat myfile.json
then you
don't get {"a"=1,"b"=2,"c"=3}
. This is because the pair-separator :
is
part of the JSON specification.
Input and output separators¶
Miller lets you use the same separators for input and output (e.g. CSV input, CSV output), or, to change them between input and output (e.g. CSV input, JSON output), if you wish to transform your data in that way.
Miller uses the names IRS
and ORS
for the input and output record
separators, IFS
and OFS
for the input and output field separators, and
IPS
and OPS
for input and output pair separators.
For example:
cat data/a.dkvp
a=1,b=2,c=3 a=4,b=5,c=6
mlr --ifs , --ofs ';' --ips = --ops : cut -o -f c,a,b data/a.dkvp
c:3;a:1;b:2 c:6;a:4;b:5
mlr --csv head -n 2 example.csv
color,shape,flag,k,index,quantity,rate yellow,triangle,true,1,11,43.6498,9.8870 red,square,true,2,15,79.2778,0.0130
mlr --csv --ofs pipe head -n 2 example.csv
color|shape|flag|k|index|quantity|rate yellow|triangle|true|1|11|43.6498|9.8870 red|square|true|2|15|79.2778|0.0130
If your data has non-default separators and you don't want to change those
between input and output, you can use --rs
, --fs
, and --ps
. Setting --fs
:
is the same as setting --ifs : --ofs :
, but with fewer keystrokes.
cat data/modsep.dkvp
a:1;b:2;c:3 a:4;b:5;c:6
mlr --fs ';' --ps : cut -o -f c,a,b data/modsep.dkvp
c:3;a:1;b:2 c:6;a:4;b:5
Multi-character separators¶
All separators can be multi-character, except for file formats which don't
allow parameterization (see below). And for CSV (CSV-lite doesn't have these
restrictions), IRS must be \n
and IFS must be a single character.
mlr --ifs ';' --ips : --ofs ';;;' --ops := cut -o -f c,a,b data/modsep.dkvp
c:=3;;;a:=1;;;b:=2 c:=6;;;a:=4;;;b:=5
If your data has field separators which are one or more consecutive spaces, you
can use --ifs space --repifs
.
More generally, the --repifs
flag means that multiple successive occurrences of the field
separator count as one. For example, in CSV data we often signify nulls by
empty strings, e.g. 2,9,,,,,6,5,4
. On the other hand, if the field separator
is a space, it might be more natural to parse 2 4 5
the same as 2 4 5
:
--repifs --ifs ' '
lets this happen. In fact, the --ipprint
option
is internally implemented in terms of --repifs
.
For example:
cat data/extra-spaces.txt
oh say can you see by the dawn's early light what so
mlr --ifs ' ' --repifs --inidx --oxtab cat data/extra-spaces.txt
1 oh 2 say 3 can 4 you 1 see 2 by 3 the 4 dawn's 1 early 2 light 3 what 4 so
Regular-expression separators¶
IFS
and IPS
can be regular expressions: use --ifs-regex
or --ips-regex
in place of
--ifs
or --ips
, respectively.
You can also use either --ifs space --repifs
or --ifs-regex '()+'
. (But that gets a little tedious,
so there are aliases listed below.) Note however that --ifs space --repifs
is about 3x faster than
--ifs-regex '( )+'
-- regular expressions are powerful, but slower.
Aliases¶
Many things we'd like to write as separators need to be escaped from the shell
-- e.g. --ifs ';'
or --ofs '|'
, and so on. You can use the following if you like:
mlr help list-separator-aliases
ascii_esc = "\x1b" ascii_etx = "\x03" ascii_fs = "\x1c" ascii_gs = "\x1d" ascii_null = "\x00" ascii_rs = "\x1e" ascii_soh = "\x01" ascii_stx = "\x02" ascii_us = "\x1f" asv_fs = "\x1f" asv_rs = "\x1e" colon = ":" comma = "," cr = "\r" crcr = "\r\r" crlf = "\r\n" crlfcrlf = "\r\n\r\n" equals = "=" lf = "\n" lflf = "\n\n" newline = "\n" pipe = "|" semicolon = ";" slash = "/" space = " " tab = "\t" usv_fs = "\xe2\x90\x9f" usv_rs = "\xe2\x90\x9e"
And for --ifs-regex
and --ips-regex
:
mlr help list-separator-regex-aliases
spaces = "( )+" tabs = "(\t)+" whitespace = "([ \t])+"
Note that spaces
, tabs
, and whitespace
already are regexes so you
shouldn't use --repifs
with them. (In fact, the --repifs
flag is ignored
when --ifs-regex
is provided.)
Command-line flags¶
Given the above, we now have seen the following flags:
--rs --irs --ors --fs --ifs --ofs --repifs --ifs-regex --ps --ips --ops --ips-regex
See also the separator-flags section.
DSL built-in variables¶
Miller exposes for you read-only built-in variables with
names IRS
, ORS
, IFS
, OFS
, IPS
, and OPS
. Unlike in AWK, you can't set these in begin-blocks --
their values indicate what you specified at the command line -- so their use is limited.
mlr --ifs , --ofs ';' --ips = --ops : --from data/a.dkvp put '$d = ">>>" . IFS . "|||" . OFS . "<<<"'
a:1;b:2;c:3;d:>>>,|||;<<< a:4;b:5;c:6;d:>>>,|||;<<<
Which separators apply to which file formats¶
Notes:
- CSV IRS and ORS must be newline, and CSV IFS must be a single character. (CSV-lite does not have these restrictions.)
- TSV IRS and ORS must be newline, and TSV IFS must be a tab. (TSV-lite does not have these restrictions.)
- See the CSV section for information about ASV and USV.
- JSON: ignores all separator flags from the command line.
- Headerless CSV overlaps quite a bit with NIDX format using comma for IFS. See also the page on CSV with and without headers.
- For XTAB, the record separator is a repetition of the field separator. For example, if one record has
x=1,y=2
and the next hasx=3,y=4
, and OFS is newline, then output lines arex 1
, theny 2
, then an extra newline, thenx 3
, theny 4
. This means: to customize XTAB, setOFS
rather thanORS
.
RS | FS | PS | |
---|---|---|---|
CSV | Always \n ; not alterable * |
Default , ; must be single-character |
None |
TSV | Always \n ; not alterable * |
Default \t ; must be single-character |
None |
CSV-lite | Default \n * |
Default , |
None |
TSV-lite | Default \n * |
Default \t |
None |
JSON | N/A; records are between { and } |
Always , ; not alterable |
Always : ; not alterable |
DKVP | Default \n |
Default , |
Default = |
NIDX | Default \n |
Default space | None |
XTAB | Not used; records are separated by an extra FS | \n * |
Default: space with repeats |
PPRINT | Default \n * |
Space with repeats | None |
Markdown | Always \n ; not alterable * |
One or more spaces, then | , then one or more spaces; not alterable |
None |
* or \r\n
on Windows