Parsing and formatting fields¶
Miller offers several ways to split strings into pieces (parsing them), and to put things together into a string (formatting them).
Splitting and joining with the same separator¶
One pattern we often have is items separated by the same separator, e.g. a field with value
1;2;3;4
-- with a ;
between every pair of items. There are several useful
DSL functions for splitting
a string into pieces, and joining pieces into a string.
For example, suppose we have a CSV file like this:
cat data/split1.csv
name,nicknames,codes Alice,"Allie,Skater","1,3,5" Robert,"Bob,Bobby,Biker","2,4,6"
mlr --icsv --ojson cat data/split1.csv
[ { "name": "Alice", "nicknames": "Allie,Skater", "codes": "1,3,5" }, { "name": "Robert", "nicknames": "Bob,Bobby,Biker", "codes": "2,4,6" } ]
Then we can use the splita
function to split the
nicknames
string into an array of strings:
mlr --icsv --ojson --from data/split1.csv put '$nicknames = splita($nicknames, ",")'
[ { "name": "Alice", "nicknames": ["Allie", "Skater"], "codes": "1,3,5" }, { "name": "Robert", "nicknames": ["Bob", "Bobby", "Biker"], "codes": "2,4,6" } ]
Likewise we can split the codes
field. Since these look like numbers, we can again use splita
which tries to type-infer ints and floats when it finds them -- or, we can use
splitax to ask for the string to be split up into
substrings, with no type inference:
mlr --icsv --ojson --from data/split1.csv put '$codes = splita($codes, ",")'
[ { "name": "Alice", "nicknames": "Allie,Skater", "codes": [1, 3, 5] }, { "name": "Robert", "nicknames": "Bob,Bobby,Biker", "codes": [2, 4, 6] } ]
mlr --icsv --ojson --from data/split1.csv put '$codes = splitax($codes, ",")'
[ { "name": "Alice", "nicknames": "Allie,Skater", "codes": ["1", "3", "5"] }, { "name": "Robert", "nicknames": "Bob,Bobby,Biker", "codes": ["2", "4", "6"] } ]
We can do operations on the array, then use joinv to put them back together:
mlr --icsv --ojson --from data/split1.csv put ' $codes = splita($codes, ","); # split into array of integers $codes = apply($codes, func(e) { return e * 100 }); # do math on the array of integers $codes = joinv($codes, ","); # join the updated array back into a string '
[ { "name": "Alice", "nicknames": "Allie,Skater", "codes": "100,300,500" }, { "name": "Robert", "nicknames": "Bob,Bobby,Biker", "codes": "200,400,600" } ]
mlr --csv --from data/split1.csv put ' $codes = splita($codes, ","); # split into array of integers $codes = apply($codes, func(e) { return e * 100 }); # do math on the array of integers $codes = joinv($codes, ","); # join the updated array back into a string '
name,nicknames,codes Alice,"Allie,Skater","100,300,500" Robert,"Bob,Bobby,Biker","200,400,600"
The full list of split functions includes splita, splitax, splitkv, splitkvx, splitnv, and splitnx. The flavors have to to with what the output is -- arrays or maps -- and whether or not type-inference is done.
The full list of join functions includes joink, joinv, and joinkv. Here the flavors have to do with whether we put array/map keys, values, or both into the resulting string.
Example: shortening hostnames¶
Suppose you want to just keep the first two components of the hostnames:
cat data/hosts.csv
host,status xy01.east.acme.org,up ab02.west.acme.org,down ac91.west.acme.org,up
Using the splita
and
joinv
functions, along with
array slicing, we get
mlr --csv --from data/hosts.csv put '$host = joinv(splita($host, ".")[1:2], ".")'
host,status xy01.east,up ab02.west,down ac91.west,up
Flatten/unflatten: representing arrays in CSV¶
In the above examples, when we split a string field into an array, we used JSON output. That's because JSON permits nested data structures. For CSV output, Miller uses, by default, a flatten/unflatten strategy: array-valued fields are turned into multiple CSV columns. For example:
mlr --icsv --ojson --from data/split1.csv put '$codes = splitax($codes, ",")'
[ { "name": "Alice", "nicknames": "Allie,Skater", "codes": ["1", "3", "5"] }, { "name": "Robert", "nicknames": "Bob,Bobby,Biker", "codes": ["2", "4", "6"] } ]
mlr --csv --from data/split1.csv put '$codes = splitax($codes, ",")'
name,nicknames,codes.1,codes.2,codes.3 Alice,"Allie,Skater",1,3,5 Robert,"Bob,Bobby,Biker",2,4,6
See the flatten/unflatten: converting between JSON and tabular formats¶ for more on this default behavior, including how to override it when you prefer.
Splitting and joining with different separators¶
The above is well and good when a string contains pieces with multiple instances of the same
separator. However sometimes we have input like 5-18:53:20
. Here we can use the more flexible
unformat and
format DSL functions. (As above, there's an
unformatx function if you want Miller to just split
the string into string pieces without trying to type-infer them.)
cat data/split2.csv
stamp,event 5-18:53:20,open 5-18:53:22,close 5-19:07:34,open 5-19:07:56,close
mlr --icsv --ojson --from data/split2.csv put '$pieces = unformat("{}-{}:{}:{}", $stamp)'
[ { "stamp": "5-18:53:20", "event": "open", "pieces": [5, 18, 53, 20] }, { "stamp": "5-18:53:22", "event": "close", "pieces": [5, 18, 53, 22] }, { "stamp": "5-19:07:34", "event": "open", "pieces": [5, 19, "07", 34] }, { "stamp": "5-19:07:56", "event": "close", "pieces": [5, 19, "07", 56] } ]
mlr --icsv --opprint --from data/split2.csv put ' pieces = unformat("{}-{}:{}:{}", $stamp); $description = format("{} day(s) {} hour(s) {} minute(s) {} seconds(s)", pieces[1], pieces[2], pieces[3], pieces[4]); '
stamp event description 5-18:53:20 open 5 day(s) 18 hour(s) 53 minute(s) 20 seconds(s) 5-18:53:22 close 5 day(s) 18 hour(s) 53 minute(s) 22 seconds(s) 5-19:07:34 open 5 day(s) 19 hour(s) 07 minute(s) 34 seconds(s) 5-19:07:56 close 5 day(s) 19 hour(s) 07 minute(s) 56 seconds(s)
Using regular expressions and capture groups¶
If you prefer regular expressions, those can be used in this context as well:
mlr --icsv --opprint --from data/split2.csv put ' if ($stamp =~ "(\d+)-(\d+):(\d+):(\d+)") { $description = "\1 day(s) \2 hour(s) \3 minute(s) \4 seconds(s)"; } '
stamp event description 5-18:53:20 open 5 day(s) 18 hour(s) 53 minute(s) 20 seconds(s) 5-18:53:22 close 5 day(s) 18 hour(s) 53 minute(s) 22 seconds(s) 5-19:07:34 open 5 day(s) 19 hour(s) 07 minute(s) 34 seconds(s) 5-19:07:56 close 5 day(s) 19 hour(s) 07 minute(s) 56 seconds(s)
Special case: timestamps¶
Timestamps are complex enough to merit their own handling: see the DSL datetime/timezone functions page. in particular the strptime and strftime functions.
Special case: dhms and seconds¶
For historical reasons, Miller has a way to represent seconds in a more human-readable format, using days, hours, minutes, and seconds. For example:
mlr --c2p --from data/sec2dhms.csv put '$dhms = sec2dhms($sec)'
sec dhms 1 1s 100 1m40s 10000 2h46m40s 1000000 11d13h46m40s
Please see sec2dhms and dhms2sec
Special case: financial values¶
One way to handle currencies is to sub out the currency marker (like $
) as well as commas:
echo 'd=$1234.56' | mlr put '$d = float(gsub(ssub($d, "$", ""), ",", ""))'
d=1234.56
Nesting and unnesting fields¶
Sometimes we want not to split strings into arrays, but rather, to use them to create multiple records.
For example:
mlr --c2p cat data/split1.csv
name nicknames codes Alice Allie,Skater 1,3,5 Robert Bob,Bobby,Biker 2,4,6
mlr --c2p nest --evar , -f nicknames data/split1.csv
name nicknames codes Alice Allie 1,3,5 Alice Skater 1,3,5 Robert Bob 2,4,6 Robert Bobby 2,4,6 Robert Biker 2,4,6
See documentation on the nest verb for general information on how to do this.