A note on the complexity of Miller's expression language¶
One of Miller's strengths is its brevity: it's much quicker -- and less
error-prone -- to type mlr stats1 -a sum -f x,y -g a,b
than having to track
summation variables as in awk
, or using Miller's out-of-stream
variables. And the more
language features Miller's put-DSL has (for-loops, if-statements, nested
control structures, user-defined functions, etc.) then the less powerful it
begins to seem: because of the other programming-language features it doesn't
have (classes, exceptions, and so on).
When I was originally prototyping Miller in 2015, the primary decision I had
was whether to hand-code in a low-level language like C or Rust or Go, with my
own hand-rolled DSL, or whether to use a higher-level language (like Python or
Lua or Nim) and let the put
statements be handled by the implementation
language's own eval
: the implementation language would take the place of a
DSL. Multiple performance experiments showed me I could get better throughput
using the former, by a wide margin. So Miller is Go under the hood with a
hand-rolled DSL.
I do want to keep focusing on what Miller is good at -- concise notation, low
latency, and high throughput -- and not add too much in terms of
high-level-language features to the DSL. That said, some sort of
customizability is a basic thing to want. As of 4.1.0 we have recursive
for
/while
/if
structures on about
the same complexity level as awk
; as of 5.0.0 we have user-defined
functions and map-valued
variables, again on about the same complexity level
as awk
along with optional type-declaration syntax; as of Miller 6 we have
full support for arrays. While I'm excited by these
powerful language features, I hope to keep new features focused on Miller's
sweet spot which is speed plus simplicity.