Skip to content
Quick links:   Flags   Verbs   Functions   Glossary   Release docs

DSL user-defined functions

As of Miller 5.0.0 you can define your own functions, as well as subroutines.

User-defined functions

Here's the obligatory example of a recursive function to compute the factorial function:

mlr --opprint --from data/small put '
    func f(n) {
        if (is_numeric(n)) {
            if (n > 0) {
                return n * f(n-1);
            } else {
                return 1;
            }
        }
        # implicitly return absent-null if non-numeric
    }
    $ox = f($x + NR);
    $oi = f($i);
'
a   b   i x        y        ox                 oi
pan pan 1 0.346791 0.726802 0.4670549976810001 1
eks pan 2 0.758679 0.522151 3.6808304227112796 2
wye wye 3 0.204603 0.338318 1.7412477437471126 6
eks wye 4 0.381399 0.134188 18.588317372151177 24
wye pan 5 0.573288 0.863624 211.38663947090302 120

Properties of user-defined functions:

  • Function bodies start with func and a parameter list, defined outside of begin, end, or other func or subr blocks. (I.e. the Miller DSL has no nested functions.)

  • A function (uniqified by its name) may not be redefined: either by redefining a user-defined function, or by redefining a built-in function. However, functions and subroutines have separate namespaces: you can define a subroutine log (for logging messages to stderr, say) which does not clash with the mathematical log (logarithm) function.

  • Functions may be defined either before or after use -- there is an object-binding/linkage step at startup. More specifically, functions may be either recursive or mutually recursive.

  • Functions may be defined and called either within mlr filter or mlr put.

  • Argument values may be reassigned: they are not read-only.

  • When a return value is not implicitly returned, this results in a return value of absent-null. (In the example above, if there were records for which the argument to f is non-numeric, the assignments would be skipped.) See also the null-data reference page.

  • See the section on Local variables for information on scope and extent of arguments, as well as for information on the use of local variables within functions.

  • See the section on Expressions from files for information on the use of -f and -e flags.

User-defined subroutines

Example:

mlr --opprint --from data/small put -q '
  begin {
    @call_count = 0;
  }
  subr s(n) {
    @call_count += 1;
    if (is_numeric(n)) {
      if (n > 1) {
        call s(n-1);
      } else {
        print "numcalls=" . @call_count;
      }
    }
  }
  print "NR=" . NR;
  call s(NR);
'
NR=1
numcalls=1
NR=2
numcalls=3
NR=3
numcalls=6
NR=4
numcalls=10
NR=5
numcalls=15

Properties of user-defined subroutines:

  • Subroutine bodies start with subr and a parameter list, defined outside of begin, end, or other func or subr blocks. (I.e. the Miller DSL has no nested subroutines.)

  • A subroutine (uniqified by its name) may not be redefined. However, functions and subroutines have separate namespaces: you can define a subroutine log which does not clash with the mathematical log function.

  • Subroutines may be defined either before or after use -- there is an object-binding/linkage step at startup. More specifically, subroutines may be either recursive or mutually recursive. Subroutines may call functions.

  • Subroutines may be defined and called either within mlr put or mlr put.

  • Subroutines have read/write access to $-variables and @-variables.

  • Argument values may be reassigned: they are not read-only.

  • See the section on local variables for information on scope and extent of arguments, as well as for information on the use of local variables within functions.

  • See the section on Expressions from files for information on the use of -f and -e flags.

Differences between functions and subroutines

Subroutines cannot return values, and they are invoked by the keyword call.

In hindsight, subroutines needn't have been invented. If foo is a function then you can write foo(1,2,3) while ignoring its return value, and that plays the role of subroutine quite well.

Loading a library of functions

If you have a file with UDFs you use frequently, say my-udfs.mlr, you can use --load or --mload to define them for your Miller scripts. For example, in your shell,

alias mlr='mlr --load ~/my-functions.mlr'

or

alias mlr='mlr --load /u/miller-udfs/'

See the miscellaneous-flags page for more information.

Function literals

You can define unnamed functions and assign them to variables, or pass them to functions.

See also the page on higher-order functions for more information on select, apply, reduce, fold, and sort. sort,

For example:

mlr --c2p --from example.csv put '
  f = func(s, t) {
    return s . ":" . t;
  };
  $z = f($color, $shape);
'
color  shape    flag  k  index quantity rate   z
yellow triangle true  1  11    43.6498  9.8870 yellow:triangle
red    square   true  2  15    79.2778  0.0130 red:square
red    circle   true  3  16    13.8103  2.9010 red:circle
red    square   false 4  48    77.5542  7.4670 red:square
purple triangle false 5  51    81.2290  8.5910 purple:triangle
red    square   false 6  64    77.1991  9.5310 red:square
purple triangle false 7  65    80.1405  5.8240 purple:triangle
yellow circle   true  8  73    63.9785  4.2370 yellow:circle
yellow circle   true  9  87    63.5058  8.3350 yellow:circle
purple square   false 10 91    72.3735  8.2430 purple:square
mlr --c2p --from example.csv put '
  a = func(s, t) {
    return s . ":" . t . " above";
  };
  b = func(s, t) {
    return s . ":" . t . " below";
  };
  f = $index >= 50 ? a : b;
  $z = f($color, $shape);
'
color  shape    flag  k  index quantity rate   z
yellow triangle true  1  11    43.6498  9.8870 yellow:triangle below
red    square   true  2  15    79.2778  0.0130 red:square below
red    circle   true  3  16    13.8103  2.9010 red:circle below
red    square   false 4  48    77.5542  7.4670 red:square below
purple triangle false 5  51    81.2290  8.5910 purple:triangle above
red    square   false 6  64    77.1991  9.5310 red:square above
purple triangle false 7  65    80.1405  5.8240 purple:triangle above
yellow circle   true  8  73    63.9785  4.2370 yellow:circle above
yellow circle   true  9  87    63.5058  8.3350 yellow:circle above
purple square   false 10 91    72.3735  8.2430 purple:square above

Note that you need a semicolon after the closing curly brace of the function literal.

Unlike named functions, function literals (also known as unnamed functions) have access to local variables defined in their enclosing scope. That's so you can do things like this:

mlr --c2p --from example.csv put '
  f = func(s, t, i) {
    if (i >= cap) {
      return s . ":" . t . " above";
    } else {
      return s . ":" . t . " below";
    }
  };
  cap = 10;
  $z = f($color, $shape, $index);
'
color  shape    flag  k  index quantity rate   z
yellow triangle true  1  11    43.6498  9.8870 yellow:triangle above
red    square   true  2  15    79.2778  0.0130 red:square above
red    circle   true  3  16    13.8103  2.9010 red:circle above
red    square   false 4  48    77.5542  7.4670 red:square above
purple triangle false 5  51    81.2290  8.5910 purple:triangle above
red    square   false 6  64    77.1991  9.5310 red:square above
purple triangle false 7  65    80.1405  5.8240 purple:triangle above
yellow circle   true  8  73    63.9785  4.2370 yellow:circle above
yellow circle   true  9  87    63.5058  8.3350 yellow:circle above
purple square   false 10 91    72.3735  8.2430 purple:square above

See the page on higher-order functions for more.