Streaming processing, and memory usage¶
What does streaming mean?¶
When we say that Miller is streaming, we mean that most operations need only a single record in memory at a time, rather than ingesting all input before producing any output.
This is contrast to, say, the dataframes approach where you ingest all data, wait for end of file, then start manipulating the data.
Both approaches have their advantages: the dataframes approach requires that all data fit in system memory (which, as hardware gets larger over time, is less and less of a constraint); the streaming approach requires that you sometimes need to accumulate results on records (rows) as they arrive rather than looping through them explicitly.
Since Miller takes the streaming approach when possible (see below for
exceptions), you can often operate on files which are larger than your system's
memory . It also means you can do
tail -f some-file | mlr --records-per-batch 1
--some-flags and Miller will operate on records as they arrive one at a time.
You don't have to wait for and end-of-file marker (which never arrives with
tail-f) to start seeing partial results. This also means if you pipe Miller's
output to other streaming tools (like
sed, and so on), they
will also output partial results as data arrives.
The statements in the Miller programming language
(outside of optional
end blocks which execute before and after all
records have been read, respectively) are implicit callbacks which are executed
once per record. For example, using
mlr --csv put '$z = $x + $y' myfile.csv,
$z = $x + $y will be executed 10,000 times if you
has 10,000 records.
If you do wish to accumulate all records into memory and loop over them explicitly, you can do so -- see the page on operating on all records.
Streaming and non-streaming verbs¶
For those operations which require deeper retention, Miller retains only as
much data as needed. For example, the
tac (stream-reverse, backward spelling of
cat) must ingest and retain all records in memory
before emitting any -- the last input record may well end up being the first
one to be emitted.
Yet other verbs, such as
stats2, retain only summary arithmetic on the
records they visit. These are memory-friendly: memory usage is bounded. However,
they only produce output at the end of the record stream.
Fully streaming verbs¶
These don't retain any state from one record to the next. They are memory-friendly, and they don't wait for end of input to produce their output.
- bar -- if not auto-mode
- nest -- if not
- reshape -- if not long-to-wide
- unsparsify if invoked with
Non-streaming, retaining all records¶
These retain all records from one record to the next. They are memory-unfriendly, and they wait for end of input to produce their output.
- bar -- if auto-mode
- nest -- if
- reshape -- if long-to-wide
- uniq -- if
mlr uniq -a -c
- unsparsify if invoked without
Non-streaming, retaining some records¶
These retain a bounded number of records from one record to the next. They are memory-friendly, but they wait for end of input to produce their output.
Non-streaming, retaining some state¶
These retain an amount of state from one record to the next, but less than if they were to retain all records in memory. They are variably memory-friendly -- depending on how many distinct values for the group-by keys exist in the input data -- and they wait for end of input to produce their output.
- stats1 -- except
mlr stats1 -sfor incremental stats before end of stream
- uniq -- if not
mlr uniq -a -c
end blocks you provide will not be executed until end of stream; otherwise these
don't want for end of stream. Similarly, if you write logic to retain all records
(see also the page on operating on all records)
these will be memory-unfriendly; otherwise they are memory-friendly.
Most simple operations such as
mlr put '$z = $x + $y' are fully streaming.
The main input files are streamed, but the join file (using
-f) is loaded into memory at the start.