Quick links: Flags Verbs Functions Glossary Release docs

Flatten/unflatten: converting between JSON and tabular formats¶

Miller has long supported reading and writing multiple file formats including CSV and JSON, as well as converting back and forth between them. Two things new in Miller 6, though, are that arrays are now fully supported, and that record values are typed throughout Miller's processing chain from input through verbs to output -- which includes improved handling for maps and arrays as record values.

This raises the question, though, of how to handle maps and arrays as record values. For JSON files, this is easy -- JSON is a nested format where values can be maps or arrays, which can contain other maps or arrays, and so on, with the nesting happily indicated by curly braces:

cat data/map-values.json

{
  "a": 1,
  "b": {"x": 2, "y": 3}
}
{
  "a": 4,
  "b": {"x": 5, "y": 6}
}

cat data/map-values-nested.json

{
  "a": 1,
  "b": {"s": {"w": 2, "x": 3}, "t": {"y": 4, "z": 5}}
}
{
  "a": 6,
  "b": {"s": {"w": 7, "x": 8}, "t": {"y": 9, "z": 10}}
}

How can we represent these in CSV files?

Miller's non-JSON formats, such as CSV, are all non-nested -- a cell in a CSV row can't contain another entire row. As we'll see in this section, there are two main ways to flatten nested data structures down to individual CSV cells -- either by key-spreading (which is the default), or by JSON-stringifying:

Key-spreading is when the single map-valued field b={"x": 2, "y": 3} spreads into multiple fields b.x=2,b.y=3;
JSON-stringifying is when the single map-valued field "b": {"x": 2, "y": 3} becomes the single string-valued field b="{\"x\":2,\"y\":3}".

Miller intends to provide intuitive default behavior for these conversions, while also providing you with more control when you need it.

Converting maps between JSON and non-JSON¶

Let's first look at the default behavior with map-valued fields. Miller's default behavior is to spread the map values into multiple keys -- using Miller's flatsep separator, which defaults to . -- to join the original record key with map keys:

cat data/map-values.json

{
  "a": 1,
  "b": {"x": 2, "y": 3}
}
{
  "a": 4,
  "b": {"x": 5, "y": 6}
}

Flattened to CSV format:

mlr --ijson --ocsv cat data/map-values.json

a,b.x,b.y
1,2,3
4,5,6

Flattened to pretty-print format:

mlr --ijson --opprint cat data/map-values.json

a b.x b.y
1 2   3
4 5   6

Using flatten-separator : instead of the default .:

mlr --ijson --opprint --flatsep : cat data/map-values.json

a b:x b:y
1 2   3
4 5   6

If the maps are more deeply nested, each level of map keys is joined in:

cat data/map-values-nested.json

{
  "a": 1,
  "b": {"s": {"w": 2, "x": 3}, "t": {"y": 4, "z": 5}}
}
{
  "a": 6,
  "b": {"s": {"w": 7, "x": 8}, "t": {"y": 9, "z": 10}}
}

mlr --ijson --opprint cat data/map-values-nested.json

a b.s.w b.s.x b.t.y b.t.z
1 2     3     4     5
6 7     8     9     10

Unflattening is simply the reverse -- from non-JSON back to JSON:

cat data/map-values.json

{
  "a": 1,
  "b": {"x": 2, "y": 3}
}
{
  "a": 4,
  "b": {"x": 5, "y": 6}
}

mlr --ijson --ocsv cat data/map-values.json

a,b.x,b.y
1,2,3
4,5,6

mlr --ijson --ocsv cat data/map-values.json | mlr --icsv --ojson cat

[
{
  "a": 1,
  "b": {
    "x": 2,
    "y": 3
  }
},
{
  "a": 4,
  "b": {
    "x": 5,
    "y": 6
  }
}
]

Converting arrays between JSON and non-JSON¶

If the input data contains arrays, these are also flattened similarly: the 1-up array indices 1,2,3,... become string keys "1","2","3",...:

cat data/array-values.json

{
  "a": 1,
  "b": [2, 3]
}
{
  "a": 4,
  "b": [5, 6]
}

mlr --ijson --opprint cat data/array-values.json

a b.1 b.2
1 2   3
4 5   6

If the arrays are more deeply nested, each level of arrays keys is joined in:

cat data/array-values-nested.json

{
  "a": 1,
  "b": [[2, 3], [4, 5]]
}
{
  "a": 6,
  "b": [[7, 8], [9, 10]]
}

mlr --ijson --opprint cat data/array-values-nested.json

a b.1.1 b.1.2 b.2.1 b.2.2
1 2     3     4     5
6 7     8     9     10

In the nested-data examples shown here, nested map values are shown containing maps, and nested array values are shown containing arrays -- of course (even though not shown here) nested map values can contain arrays, and vice versa.

Unflattening arrays is, again, simply the reverse -- from non-JSON back to JSON:

cat data/array-values.json

{
  "a": 1,
  "b": [2, 3]
}
{
  "a": 4,
  "b": [5, 6]
}

mlr --ijson --ocsv cat data/array-values.json

a,b.1,b.2
1,2,3
4,5,6

mlr --ijson --ocsv cat data/array-values.json | mlr --icsv --ojson cat

[
{
  "a": 1,
  "b": [2, 3]
},
{
  "a": 4,
  "b": [5, 6]
}
]

Auto-inferencing of arrays on unflatten¶

Note that the CSV field names b.x and b.y aren't too different from b.1 and b.2. Miller has the heuristic that if it's unflattening and gets a map with keys "1", "2", etc. -- starting with "1", consecutively, and with no gaps -- it turns that back into an array. This is precisely to undo the flatten conversion. However, it may (or may not) be surprising:

cat data/consecutive.csv

a.1,a.2,a.3
4,5,6

mlr --c2j cat data/consecutive.csv

[
{
  "a": [4, 5, 6]
}
]

cat data/non-consecutive.csv

a.1,a.3,a.5
4,5,6

mlr --c2j cat data/non-consecutive.csv

[
{
  "a": {
    "1": 4,
    "3": 5,
    "5": 6
  }
}
]

Non-inferencing cases¶

An additional heuristic is that if a field name starts with a ., ends with a ., or has two or more consecutive . characters, no attempt is made to unflatten it on conversion from non-JSON to JSON.

cat data/flatten-dots.csv

a,b.,.c,.,d..e,f.g
1,2,3,4,5,6

mlr --icsv --oxtab cat data/flatten-dots.csv

a    1
b.   2
.c   3
.    4
d..e 5
f.g  6

mlr --icsv --ojson cat data/flatten-dots.csv

[
{
  "a": 1,
  "b.": 2,
  ".c": 3,
  ".": 4,
  "d..e": 5,
  "f": {
    "g": 6
  }
}
]
mlr: field name "b." contains separator "." but cannot be auto-unflattened; treating as a literal string. Use --no-auto-unflatten to suppress this warning.
mlr: field name ".c" contains separator "." but cannot be auto-unflattened; treating as a literal string. Use --no-auto-unflatten to suppress this warning.
mlr: field name "." contains separator "." but cannot be auto-unflattened; treating as a literal string. Use --no-auto-unflatten to suppress this warning.
mlr: field name "d..e" contains separator "." but cannot be auto-unflattened; treating as a literal string. Use --no-auto-unflatten to suppress this warning.

Manual control¶

To see what our options are for manually controlling flattening and unflattening (if the defaults aren't working for us in a particular situation), let's first look a little into how they're implemented.

There are two verbs called flatten and unflatten.
When the output format is not JSON, if you've specified mlr ... cat then sort ... (some chain of verbs) then Miller appends, in effect, then flatten to the end of the chain.
- This behavior is on by default but it can be suppressed using the --no-auto-flatten flag.
When the output format is JSON and the input format is not JSON, then (similarly) appends, in effect, then unflatten to the end of the chain.
- This behavior is on by default but it can be suppressed using the --no-auto-unflatten flag.

Note in particular that auto-flatten happens even when the input format and the output format are both non-JSON, e.g. even for CSV-to-CSV processing. This is because map-valued/array-valued fields can be produced using DSL statements:

cat data/hostnames.csv

host,status
apoapsis.east.our.org,up
nadir.west.our.org,down

Using JSON output, we can see that splita has produced an array-valued field named components:

mlr --icsv --ojson --from data/hostnames.csv put '$components = splita($host, ".")'

[
{
  "host": "apoapsis.east.our.org",
  "status": "up",
  "components": ["apoapsis", "east", "our", "org"]
},
{
  "host": "nadir.west.our.org",
  "status": "down",
  "components": ["nadir", "west", "our", "org"]
}
]

Using CSV output, with default auto-flatten, we get components.1 through components.4:

mlr --csv --from data/hostnames.csv put '$components = splita($host, ".")'

host,status,components.1,components.2,components.3,components.4
apoapsis.east.our.org,up,apoapsis,east,our,org
nadir.west.our.org,down,nadir,west,our,org

Using CSV output, without default auto-flatten, we get a JSON-stringified encoding of the components field:

mlr --csv --from data/hostnames.csv --no-auto-flatten put '$components = splita($host, ".")'

host,status,components
apoapsis.east.our.org,up,"[""apoapsis"", ""east"", ""our"", ""org""]"
nadir.west.our.org,down,"[""nadir"", ""west"", ""our"", ""org""]"

Now suppose we ran this

mlr --icsv --oxtab --from data/hostnames.csv --no-auto-flatten put '
  $a = splita($host, ".");
  $b = splita($host, ".");
'

host   apoapsis.east.our.org
status up
a      ["apoapsis", "east", "our", "org"]
b      ["apoapsis", "east", "our", "org"]

host   nadir.west.our.org
status down
a      ["nadir", "west", "our", "org"]
b      ["nadir", "west", "our", "org"]

into a file data/hostnames.xtab:

cat data/hostnames.xtab

host   apoapsis.east.our.org
status up
a      ["apoapsis", "east", "our", "org"]
b      ["apoapsis", "east", "our", "org"]

host   nadir.west.our.org
status down
a      ["nadir", "west", "our", "org"]
b      ["nadir", "west", "our", "org"]

This was written with --no-auto-unflatten so we need to manually revive the array-valued fields, if we choose -- here, we can JSON-parse the a field and leave b JSON-stringified:

mlr --ixtab --ojson json-parse -f a data/hostnames.xtab

[
{
  "host": "apoapsis.east.our.org",
  "status": "up",
  "a": ["apoapsis", "east", "our", "org"],
  "b": "[\"apoapsis\", \"east\", \"our\", \"org\"]"
},
{
  "host": "nadir.west.our.org",
  "status": "down",
  "a": ["nadir", "west", "our", "org"],
  "b": "[\"nadir\", \"west\", \"our\", \"org\"]"
}
]

See also the JSON parse and stringify section section for more on this -- for example, when Miller is producing SQL-query output from tables having one or more columns that contain JSON-encoded data.