Glossary¶
$*¶
All key-value pairs in the current record, as a map.
For example, if myfile.csv
has header line a,b,c
, and the
third line after the header is 7,8,9
, then the third record
processed by Miller will be the ordered list of key-value pairs a=7
, b=8
,
c=9
, and $*
will be (using JSON formatting) {"a": 7, "b": 8, "c": 9 }
.
@*¶
All out-of-stream
variables, as a
map. Synonymous with all
.
For example, if out-of-stream variables @count = 3
and @sum = 55
have been
assigned, then @*
will be (using JSON formatting) {"count": 3, "sum": 55}
.
absent¶
The data type obtained from accessing a missing
key, e.g. $x
when the current record has no field named x
. See the
null-data page.
all¶
All out-of-stream variables, as a map.
Synonymous with @*
.
array¶
A list of values, indexable by integers starting with 1 for the first value.
auxents¶
Stands for auxiliary entry points. These are effectively separate programs,
but bundled together inside the Miller executable for convenience. For example,
mlr termcvt
converts from CR-LF to LF format or vice versa. See the
auxiliary-commands page page for more
information.
begin¶
A keyword in the Miller programming language
indicating the start of a begin
block within an instance of the put
or filter
verb. See begin/end blocks.
block¶
A group of statements between {
and }
in the Miller programming language,
including if-statement bodies, for-loop bodies, begin-block bodies,
end-block bodies, etc.
bool¶
A keyword for type declaration, used for variables taking boolean (true/false) values.
break¶
Used for exiting a for-loop or while-loop earlier than its top-of-loop continuation expression would have specific.
BZIP2 / .bz2¶
A data-compression format supported by Miller.
Files compressed using BZIP2 compression normally end in.bz2
.
call¶
A keyword used for invoking a user-defined subroutine.
colorization¶
Miller uses configurable colors for some output to the terminal. See the output-colorization page for more information.
compression¶
A technique for having disk files take up less space. See the compressed-data page for information on how Miller handles this.
continue¶
Used for jumping to the next iteration of a for-loop or while-loop without executing the remaining loop-body statements on the current iteration.
CSV¶
Stands for comma-separated values. A popular file format for tabular data, which Miller supports.
Cygwin¶
A collection of GNU and open-source tools which provide functionality similar to a Linux distribution on Windows. Miller can run inside Cygwin, but does not need to. See Miller on Windows.
data line¶
Any line after the first (header) line of a CSV or TSV file. The header line contains the keys for all records in the file; data lines contain values to be zipped together with those keys to form records. See record.
Note that a data line can contain more line in the sense that it can contain embedded newlines within double quotes: see also RFC 4180 and the Miller CSV documentation.
delimiter¶
A delimiter is something that goes in between each item in a list of things.
For example, writing an array as [1,2,3,4,5]
, we can say that the
comma character delimits the list items.
More specifically, in terms of Miller file formats, delimiter can be used as a synonym for separator.
division¶
Miller uses pythonic division for quotients of integers, with the exception that integer divided by integer is integer (not float) if the quotient can be represented exactly as an integer.
DKVP¶
Stands for delimited key-value pairs. A Miller-specific file
format, with each line of a file being
of the form x=1,y=2,z=3
. For historical reasons, this is Miller's default
format unless flags such as --csv
are
supplied. You can also make CSV your default format using a .mlrrc file.
do¶
A keyword which is used to indicate the start of a do-while loop in the Miller programming language.
DSL¶
Stands for domain-specific language. The Miller programming language is embedded within the put and filter verbs. It's a language with its own syntax and semantics; the Miller executable does not embed, say, Python or Lua as a language for put and filter statements. This makes the Miller programming language an embedded domain-specific language, or domain-specific language, or (more briefly) a DSL.
dump¶
A keyword in the Miller programming
language which is used for printing variables to the
screen (namely, to stdout). Largely synonymous with
print
, except that print
with no arguments prints nothing, while
dump
with no arguments displays all currently defined out-of-stream
variables.
See also dump statements.
edump¶
Same as dump
, except it prints to stderr rather than stdout.
See also dump statements.
elif¶
A keyword which is used to indicate the else-if-part of an
if-statement in the Miller programming
language. In some
languages this is elsif
or else if
; in
Miller's programming language, elif
.
else¶
A keyword which is used to indicate the else-part of an
if-statement in the Miller programming
language. See also elif
.
emit, emitf, emitp¶
Three keywords in the Miller programming language for injecting new records into the record stream using the put or filter verbs.
See also the emit-statements section.
empty¶
Refers to the string with zero characters. For example, in a CSV file with header line
a,b,c
and data ,,
the three fields are empty; with data 1,2,
the first two fields (a
and b
)
are not empty and the third field c
is empty.
end¶
A keyword in the Miller programming language
indicating the start of an end
block within an instance of the put
or filter
verb. See begin/end blocks.
ENV¶
A keyword in the Miller programming language for accessing a readable/writable map of environment variables
eprint¶
Same as print
, except it prints to stderr rather than stdout.
eprintn¶
Same as printn
, except it prints to stderr rather than stdout.
false¶
A keyword in the Miller programming language for the
boolean literal; signified by False
in Python; in some languages (such as C)
signified by the zero integer value.
field¶
A single key-value pair within a record.
file format¶
A standard way for encoding information within a text file. Examples include CSV, TSV, and JSON. See the file-formats page for information on which file formats Miller handles.
FILENAME¶
A built-in variable in the Miller programming language referring to the name of the current file being processed as Miller streams through your data.
See the section on built-in variables.
FILENUM¶
A built-in variable in the Miller programming language referring to the one-up index of the current file being processed as Miller streams through your data.
See the section on built-in variables.
filter¶
Along with put, one of the Miller verbs which use the Miller programming language.
Also, a keyword which you can use within put
statements: see the
page on DSL filter statements.
See the DSL overview.
flatten¶
To convert map-valued and/or array-valued fields to something representable in CSV and other non-JSON file formats -- either by JSON-stringifying them or by key spreading. See the flatten/unflatten page.
See also unflatten.
float¶
A floating-point number as a value in Miller records, and in the Miller programming language. Floats interconvert seamlessly with integers using Miller's arithmetic rules, so usually you only need to think of numbers, rather than ints and floats separately.
Also, float
is a keyword for type declaration.
FNR¶
Like NR
but resets to 1 at the start of each file in the input
stream. If you have mlr ... a.csv b.csv
where a.csv
has 10
records and b.csv
has 20, then FNR
will be 10 on the last record of
a.csv
, then it will have value 1 on the first record of b.csv
.
See also the section on built-in variables.
for¶
A keyword which is used to indicate the start of a for-loop in the Miller programming language.
format¶
See file format.
func¶
A keyword used for defining a user-defined functions in the Miller programming language.
funct¶
A type declaration used for local variables, function arguments, and function return values which are (named) user-defined functions or (unnamed) function literals.
See the variables page for examples.
function¶
A bit of callable code in the Miller programming language which takes zero or more arguments, and optionally returns a value.
See the page on built-in functions to see functions which are present in Miller.
See the page on user-defined functions for how to write your own functions.
function literal¶
A function without a name, like func(a,b) {return a + 2*b + 7}
, assigned to
a local variable or passed to a higher-order
function like apply
or sort
. See
the section on function literals.
GZIP / .gz¶
A data-compression format supported by Miller.
Files compressed using GZIP compression normally end in .gz
.
hashmap¶
See map.
header line¶
The first line of a CSV or TSV file. It contains the keys for all records in the file; subsequent lines contain values to be zipped together with those keys to form records. See record.
Note that a header line can contain more line in the sense that it can contain embedded newlines within double quotes: see also RFC 4180 and the Miller CSV section.
heterogeneity¶
Referring to data where not all records have the same keys, in the same order. See the record-heterogeneity page.
higher-order function¶
A function which takes another function as an argument, such as
select
or
apply
. See the page on
higher-order functions.
homogeneity¶
Referring to data where all records have the same keys, in the same order. See the record-heterogeneity page.
if¶
A keyword which is used to indicate the start of an if-statement in the Miller programming language.
IFS¶
Stands for input field separator. See the separators page.
in¶
A keyword in the Miller programming language for single-variable for-loops and key-value for-loops.
in-place¶
Indicates that a file will be modified after processing. Miller's default mode
is to read one or more files (or standard input on a pipe) and to write to
standard output. This normally goes to the terminal, but can be redirected to
another pipe, or an output file -- for example,
mlr --csv sort myfile.csv
(prints sorted output to the terminal),
mlr --csv sort myfile.csv | some-other-command
, or
mlr --csv sort myfile.csv > newfile.csv
.
In all these cases, the original myfile.csv
is left unmodified.
But using Miller's -I
flag, we can update the original file: e.g. mlr -I --csv sort myfile.csv
won't print the sorted output to the terminal, but rather will write it back to myfile.csv
.
See also the section on in-place mode.
int¶
A 64-bit signed integer as a value in Miller records, and in the Miller programming language. Ints interconvert seamlessly with floats using Miller's arithmetic rules, so usually you only need to think of numbers, rather than ints and floats separately.
Also, int
is a keyword for type declaration.
IPS¶
Stands for input pair separator. See the separators page.
IRS¶
Stands for input record separator. See the separators page.
JSON¶
Stands for JavaScript object notation. A popular file format for tabular data supported by Miller.
JSON Lines¶
A file format related to JSON, supported by Miller. Key points are that every record is an object written on a single line, without need to be wrapped an outermost list. This format helps people interoperate with non-JSON-aware tools in the Unix toolkit which generally operate on lines.
key¶
The string index in a map. Also, the name of a field in a record.
keyword¶
A reserved name in the
Miller programming language
which you can't use for any other purpose. For example, if
, for
, and while
are keywords;
trying to define a local variable if = 3
will result in a parse error.
line¶
A subsequence of a text file in between line-ending symbols such as the special linefeed character. Tools in the Unix toolkit generally operate on lines; Miller is designed to do that (using the NIDX format flags), as well as non-line-oriented formats such as CSV, TSV, JSON, and others.
local variable¶
A variable in the Miller programming language whose extent is limited to the expression in which it appears; contrast out-of-stream variables which endure across the entire record stream. See the section on local variables.
manpage / manual page¶
A form of on-line help which is common in Unix-like operating systems, including MacOS and BSD variants.
If you've installed Miller using your system's package-install tools
(versus say building Miller from source),
you can probably see Miller's manual page using man mlr
at a terminal prompt.
Regardless, you can find the same content within this documentation site.
map¶
A data structure in the Miller programming language containing an ordered sequence of key-value pairs. See the maps page for more information.
Note that Miller operates on records by treating them as maps.
.mlrrc¶
A file you can create, nominally in your home directory, to customize the
default flag-settings used by Miller. For example, while Miller's default file
format is DKVP, you can make the default format be CSV so that instead
of mlr --csv sort myfile.csv
you can simply do mlr sort myfile.csv
. See
the customization page.
MSYS2¶
MSYS2 is a collection of tools and libraries providing an easy-to-use environment for building, installing and running native Windows software. Miller on Windows no longer requires this as of Miller version 6.
M_E¶
A built-in variable in the Miller programming language referring to the mathematic constant e. The M is for math.
M_PI¶
A built-in variable in the Miller programming language referring to the mathematic constant π. The M is for math.
NF¶
Stands for number of fields. A read-only built-in variable in the Miller programming language which shows the number of fields in the current record.
NIDX¶
Stands for numerically indexed. This is a format directive telling Miller to process files one line at a time, splitting lines into fields, with the resulting fields indexed one-up as in the Unix toolkit.
See also the file-formats page.
NR¶
Stands for number of records. Unlike NF
, which counts definitely the
total number of fields within the current record, since
Miller is streaming the NR
built-in
variable counts the number of
records so far, counting upward from one. So, on the first record
the NR
variable will have value 1, on the second record the NR
variable
will have value 2, and so on.
This increments a total count across files, so if you have mlr ... a.csv b.csv
where a.csv
has 10 records and b.csv
has 20, then NR
will be 30 on the last
record of b.csv
.
See also FNR
.
See also the section on built-in variables.
null¶
This term is used in various programming languages to indicate the absence of something:
for example, neither true
nor false
, but rather unspecified or no data available here.
Miller has more than one kind: see the page on null/empty/absent data.
num¶
The num
keyword is used for type declaration
in the Miller programming language. The num
type
encompasses both int
and float
. Ints and floats
interconvert seamlessly using Miller's arithmetic
rules, so usually you only need to think of
numbers, rather than ints and floats separately.
OFS¶
Stands for output field separator. See the separators page.
one-up¶
A way of indexing arrays. If x=["a", "b", "c"]
, then using one-up indexing,
x[1]
is "a"
, x[2]
is "b"
, and x[3]
is "c"
. Miller uses one-up indexing.
Contrast zero-up indexing.
See also the arrays page, as well as the page on differences from other programming languages.
oosvar¶
A whimsical shorthand for out-of-stream variable.
OPS¶
Stands for output pair separator. See the separators page.
ORS¶
Stands for output record separator. See the separators page.
Out-of-stream variable¶
Variables, prefixed with the @
sigil, which persist their values
across multiple records in the Miller programming language.
See out-of-stream variables
for more information.
PPRINT¶
A Miller-specific file format for key-value pairs, with columns vertically aligned for easy visual scanning.
print¶
A keyword in the Miller programming language for printing things to the terminal, with final newline printed for you.
See also printn
which does not insert the final newline.
See also emit
which inserts new records into the record stream.
printn¶
A keyword in the Miller programming language for printing things to the terminal, with no final newline printed for you.
See also print
which does insert the final newline.
put¶
Along with filter, one of the Miller verbs which use the Miller programming language.
See the DSL overview.
ragged¶
Referring to data where not all records have the same number of keys, particularly in a malformed-CSV context. See the record-heterogeneity page.
record¶
An ordered list of key-value pairs.
Miller's fundamental streaming operation is to read one record at a time from input file(s) you specify, using some input format; transforming those records using one or more verbs you specify; then printing them out in some output format.
For CSV files, each record gets its keys
from the file's header line, zipped together with values from a given data
line's data line. For example, if myfile.csv
has header line a,b,c
, and the
third line after the header is 7,8,9
, then the third record processed by
Miller will be the ordered list of key-value pairs a=7
, b=8
, c=9
.
For JSON files, each record is a JSON object which isn't nested inside another one.
See also the Miller command structure page.
rectangular¶
Referring to data where all records have the same keys, in the same order. Synonymous with homogeneous. See the record-heterogeneity page.
REPL¶
Stands for read-evaluate-print loop, such as when you invoke python
with no
arguments: a place where you can type 1+2
and get 3
. Miller has a
REPL you can use.
return¶
A keyword in the Miller programming language which is used for returning control from a function to its caller, optionally returning a value from the function.
semicolon¶
Semicolons are used to delimit statements in the Miller programming language.
separator¶
Used in two senses:
(1) In some programming languages, such as C, C++, and Java, semicolons are required after every statement; in others such as Python, they're not required at all; in yet others, they're required in between statements but are optional after the last. Miller is in the third category, so we can say that semicolons are separators, not terminators, within the Miller programming language.
(2) Refers to character sequences which separate records from one another (like
newlines, sometimes), fields from one another (like commas in CSV), and keys
from values in key-value pairs (=
or :
, perhaps). See the separators
page for more information.
sparse¶
Referring to data where not all records have the same keys. See the record-heterogeneity page.
stderr¶
A keyword in the Miller programming language for print, dump, and tee statements indicating that data are to be sent to the standard error.
stdout¶
A keyword in the Miller programming language for print, dump, and tee statements indicating that data are to be sent to the standard output.
str¶
A keyword for type declaration, indicating that a variable is intended to be of type string.
streaming¶
Refers to operations which can be done a record at a time, so (a) output is
produced as input records arrive, before end of input stream, and (b) memory
usage is typically bounded. The latter means that a streaming processor can
operate on data files larger than system memory. Most of Miller's operations
are streaming; some (such as sort
) need to see all data before producing any
output, and are non-streaming. Please see the page on Streaming processing and
memory usage.
subr¶
A keyword used for defining a user-defined subroutine.
subroutine¶
A user-definable bit of code in the Miller programming language, intended to be called for its side effects rather than for returning a value.
tee¶
In Unix-like and other systems, a tee is a command which reads standard input and writes both standard output and a specified file -- duplicating its output. The name comes from the T-splitter used in plumbing whose shape looks like the capital letter T.
One particular use-case is to snapshot data at an intermediate point in a
processing pipeline -- e.g. thing1 | thing2 | tee output2.dat | thing3 |
thing4
.
Miller has a tee in two places: (1) a verb you can insert into a
Miller then-chain, and (2) an output
statement in the Miller programming
language. Using the latter, you have the additional
option of using a tee-to file name which is variable, perhaps depending on the
current record. For example, if you have a large file with an id
column, you
can split it into several files, one for each distinct id
. See the section
on tee statements for an
example.
terminals¶
These include mlr help
, mlr regtest
, mlr repl
, and mlr version
. They
aren't verbs but they can be preceded by various command-line flags. They're in
contrast to auxents which are effectively standalone programs
packaged with Miller.
terminator¶
Used in two senses:
(1) Refers to whichever character sequence terminates a line of text, such as newline/linefeed (LF) or a carriage-return-linefeed pair (CR/LF). See also https://en.wikipedia.org/wiki/Newline.
(2) In some programming languages, such as C, C++, and Java, semicolons are required after every statement; in others such as Python, they're not required at all; in yet others, they're required in between statements but are optional after the last. Miller is in the third category, so we can say that semicolons are separators, not terminators, within the Miller programming language.
toolkit¶
See Unix toolkit.
true¶
A keyword in the Miller programming language for the boolean
literal; signified by True
in Python; in some languages (such as C) signified by
non-zero integer values.
TSV¶
Stands for tab-separated values. A popular file format for tabular data (tab-separated values) supported by Miller.
UDF¶
A user-defined function in the Miller programming language.
unflatten¶
To undo the flatten operation, restoring map-valued and/or array-valued fields encoded in CSV and other non-JSON file formats for JSON output. See the flatten/unflatten page.
Unix toolkit¶
The term Unix toolkit
refers to a collection of command-line programs present in Unix and Unix-like
operating systems, BSD variants, MacOS, etc. Examples include awk
, sed
,
grep
, cat
, and cut
. Common characteristics include processing data files
one line at a time, reading input from one or more files, reading input from
standard input if no files are specified, writing output to standard output,
and connecting the output of one program to the input of another using
pipes. Miller is designed
explicitly to work well in this paradigm alongside items in the Unix toolkit.
Moreover, several of Miller's verbs are designed
to imitate some of the programs in the Unix toolkit, but with ability
to operate on richer file formats such as CSV, TSV,
JSON, and others.
unnamed function¶
See function literal.
unset¶
A keyword in the Miller programming language for removing the definition of a local or out-of-stream variable, or for removing a key from the current record.
See the DSL unset statements page.
unsparse¶
Transforming data so that all records have the same keys, by filling in default values. See the record-heterogeneity page.
value¶
The thing indexed by a key in a map. Miller values take one of Miller's data types. See also record.
var¶
A keyword for type declaration. It means a variable can have any type, which in itself is not useful; its usefulness comes from letting you declare a new variable, in an inner scope, of the same name as another in an outer scope.
variable¶
A way to access data by name within the Miller programming language. See the DSL variables page.
verb¶
One of the ways you ask Miller to transform your data as it
processes it. Many of Miller's verbs such as
sort
and cut
are
file-format-aware analogues of tools in the Unix
toolkit.
See the List of verbs page.
while¶
A keyword which is used to indicate the start of a while-loop, and also used in do-while loops, in the Miller programming language.
XTAB¶
Stands for transposed tabular. A Miller-specific file
format for key-value pairs: it's a
vertical-tabular format useful for looking a files with a large number of
columns. Example: mlr --icsv --oxtab head -n 1 widefile.csv
.
zero-up¶
A way of indexing arrays. If x=["a", "b", "c"]
, then using zero-up indexing,
x[0]
is "a"
, x[1]
is "b"
, and x[2]
is "c"
. Miller uses one-up indexing.
See also the arrays page, as well as the page on differences from other programming languages.
ZLIB / .z¶
A data-compression format supported by Miller.
Files compressed using ZLIB compression normally end in .z
.
ZSTD / .zst¶
A data-compression format supported by Miller.
Files compressed using ZSTD compression normally end in.zst
.