Regular expressions¶
Miller lets you use regular expressions (of the types accepted by Go) in the following contexts:
-
In
mlr filter
with=~
or!=~
, e.g.mlr filter '$url =~ "http.*com"'
-
In
mlr put
withregextract
, e.g.mlr put '$output = regextract($input, "[a-z][a-z][0-9][0-9]")
-
In
mlr put
withsub
orgsub
, e.g.mlr put '$url = sub($url, "http.*com", "")'
-
In
mlr having-fields
, e.g.mlr having-fields --any-matching '^sda[0-9]'
-
In
mlr cut
, e.g.mlr cut -r -f '^status$,^sda[0-9]'
-
In
mlr rename
, e.g.mlr rename -r '^(sda[0-9]).*$,dev/\1'
-
In
mlr grep
, e.g.mlr --csv grep 00188555487 myfiles*.csv
Points demonstrated by the above examples:
-
There are no implicit start-of-string or end-of-string anchors; please use
^
and/or$
explicitly. -
Miller regexes are wrapped with double quotes rather than slashes.
-
The
i
after the ending double quote indicates a case-insensitive regex. -
Capture groups are wrapped with
(...)
rather than\(...\)
; use\(
and\)
to match against parentheses.
Example:
cat data/regex-in-data.dat
name=jane,regex=^j.*e$ name=bill,regex=^b[ou]ll$ name=bull,regex=^b[ou]ll$
mlr filter '$name =~ $regex' data/regex-in-data.dat
name=jane,regex=^j.*e$ name=bull,regex=^b[ou]ll$
Regex captures for the =~
operator¶
Regex captures of the form \0
through \9
are supported as follows:
- Captures have in-function context for
sub
andgsub
. For example, the first\1,\2
pair belong to the firstsub
and the second\1,\2
pair belong to the secondsub
:
mlr put '$b = sub($a, "(..)_(...)", "\2-\1"); $c = sub($a, "(..)_(.)(..)", ":\1:\2:\3")'
- Captures endure for the entirety of a
put
for the=~
and!=~
operators. For example, here the\1,\2
are set by the=~
operator and are used by both subsequent assignment statements:
mlr put '$a =~ "(..)_(....); $b = "left_\1"; $c = "right_\2"'
- Each user-defined function has its own frame for captures. For example:
mlr -n put ' func f() { if ("456 defg" =~ "([0-9]+) ([a-z]+)") { print "INNER: \1 \2"; } } end { if ("123 abc" =~ "([0-9]+) ([a-z]+)") { print "OUTER PRE: \1 \2"; f(); print "OUTER POST: \1 \2"; } }'
- The captures are not retained across multiple puts. For example, here the
\1,\2
won't be expanded from the regex capture:
mlr put '$a =~ "(..)_(....)' then {... something else ...} then put '$b = "left_\1"; $c = "right_\2"'
- Up to nine matches are supported:
\1
through\9
, while\0
is the entire match string;\15
is treated as\1
followed by an unrelated5
.
Resetting captures¶
If you use (...)
in your regular expression, then up to 9 matches are supported for the =~
operator, and an arbitrary number of matches are supported for the match
DSL function.
- Before any match is done,
"\1"
etc. in a string evaluate to themselves. - After a successful match is done,
"\1"
etc. in a string evaluate to the matched substring. - After an unsuccessful match is done,
"\1"
etc. in a string evaluate to the empty string. - You can match against
null
to reset to the original state.
mlr repl
[mlr] "\1:\2" "\1:\2" [mlr] "abc" =~ "..." true [mlr] "\1:\2" ":" [mlr] "abc" =~ "(.).(.)" true [mlr] "\1:\2" "a:c" [mlr] "abc" =~ "(.)x(.)" false [mlr] "\1:\2" ":" [mlr] "abc" =~ null [mlr] "\1:\2" "\1:\2"
The strmatch
and strmatchx
DSL functions¶
The =~
and !=~
operators have been in Miller for a long time, and they will continue to be
supported. They do, however, have some deficiencies. As of Miller 6.11 and beyond, the strmatch
and strmatchx
provide more robust ways to do capturing.
First, some examples.
The strmatch
function only returns a boolean result, and it doesn't set \0..\9
:
mlr repl
[mlr] strmatch("abc", "....") false [mlr] strmatch("abc", "...") true [mlr] strmatch("abc", "(.).(.)") true [mlr] strmatch("[ab:3458]", "([a-z]+):([0-9]+)") true
The strmatchx
function also doesn't set \0..\9
, but returns a map-valued result:
mlr repl
[mlr] strmatchx("abc", "....") { "matched": false } [mlr] strmatchx("abc", "...") { "matched": true, "full_capture": "abc", "full_start": 1, "full_end": 3 } [mlr] strmatchx("abc", "(.).(.)") { "matched": true, "full_capture": "abc", "full_start": 1, "full_end": 3, "captures": ["a", "c"], "starts": [1, 3], "ends": [1, 3] } [mlr] "[ab:3458]" =~ "([a-z]+):([0-9]+)" true [mlr] "\1" "ab" [mlr] "\2" "3458" [mlr] strmatchx("[ab:3458]", "([a-z]+):([0-9]+)") { "matched": true, "full_capture": "ab:3458", "full_start": 2, "full_end": 8, "captures": ["ab", "3458"], "starts": [2, 5], "ends": [3, 8] }
Notes:
- When there is no match, the result from
strmatchx
only has the"matched":false
key/value pair. - When there is a match with no captures, the result from
strmatchx
has the"matched":true
key/value pair, as well asfull_capture
(taking the place of\0
set by=~
), andfull_start
andfull_end
which=~
does not offer. - When there is a match with no captures, the result from
strmatchx
also has thecaptures
array whose slots 1, 2, 3, ... are the same as would have been set by=~
via\1, \2, \3, ...
. However,strmatchx
offers an arbitrary number of captures, not just\1..\9
. Additionally, thestarts
andends
arrays are indices into the input string. - Since you hold the return value from
strmatchx
, you can operate on it as you wish --- instead of relying on the (function-scoped) globals\0..\9
. - The price paid is that using
strmatchx
does indeed tend to take more keystrokes than=~
.
More information¶
Regular expressions are those supported by the Go regexp package, which in turn are of type RE2 except for \C
:
go doc regexp/syntax
package syntax // import "regexp/syntax" Package syntax parses regular expressions into parse trees and compiles parse trees into programs. Most clients of regular expressions will use the facilities of package regexp (such as regexp.Compile and regexp.Match) instead of this package. # Syntax The regular expression syntax understood by this package when parsing with the Perl flag is as follows. Parts of the syntax can be disabled by passing alternate flags to Parse. Single characters: . any character, possibly including newline (flag s=true) [xyz] character class [^xyz] negated character class \d Perl character class \D negated Perl character class [[:alpha:]] ASCII character class [[:^alpha:]] negated ASCII character class \pN Unicode character class (one-letter name) \p{Greek} Unicode character class \PN negated Unicode character class (one-letter name) \P{Greek} negated Unicode character class Composites: xy x followed by y x|y x or y (prefer x) Repetitions: x* zero or more x, prefer more x+ one or more x, prefer more x? zero or one x, prefer one x{n,m} n or n+1 or ... or m x, prefer more x{n,} n or more x, prefer more x{n} exactly n x x*? zero or more x, prefer fewer x+? one or more x, prefer fewer x?? zero or one x, prefer zero x{n,m}? n or n+1 or ... or m x, prefer fewer x{n,}? n or more x, prefer fewer x{n}? exactly n x Implementation restriction: The counting forms x{n,m}, x{n,}, and x{n} reject forms that create a minimum or maximum repetition count above 1000. Unlimited repetitions are not subject to this restriction. Grouping: (re) numbered capturing group (submatch) (?Pre) named & numbered capturing group (submatch) (? re) named & numbered capturing group (submatch) (?:re) non-capturing group (?flags) set flags within current group; non-capturing (?flags:re) set flags during re; non-capturing Flag syntax is xyz (set) or -xyz (clear) or xy-z (set xy, clear z). The flags are: i case-insensitive (default false) m multi-line mode: ^ and $ match begin/end line in addition to begin/end text (default false) s let . match \n (default false) U ungreedy: swap meaning of x* and x*?, x+ and x+?, etc (default false) Empty strings: ^ at beginning of text or line (flag m=true) $ at end of text (like \z not \Z) or line (flag m=true) \A at beginning of text \b at ASCII word boundary (\w on one side and \W, \A, or \z on the other) \B not at ASCII word boundary \z at end of text Escape sequences: \a bell (== \007) \f form feed (== \014) \t horizontal tab (== \011) \n newline (== \012) \r carriage return (== \015) \v vertical tab character (== \013) \* literal *, for any punctuation character * \123 octal character code (up to three digits) \x7F hex character code (exactly two digits) \x{10FFFF} hex character code \Q...\E literal text ... even if ... has punctuation Character class elements: x single character A-Z character range (inclusive) \d Perl character class [:foo:] ASCII character class foo \p{Foo} Unicode character class Foo \pF Unicode character class F (one-letter name) Named character classes as character class elements: [\d] digits (== \d) [^\d] not digits (== \D) [\D] not digits (== \D) [^\D] not not digits (== \d) [[:name:]] named ASCII class inside character class (== [:name:]) [^[:name:]] named ASCII class inside negated character class (== [:^name:]) [\p{Name}] named Unicode property inside character class (== \p{Name}) [^\p{Name}] named Unicode property inside negated character class (== \P{Name}) Perl character classes (all ASCII-only): \d digits (== [0-9]) \D not digits (== [^0-9]) \s whitespace (== [\t\n\f\r ]) \S not whitespace (== [^\t\n\f\r ]) \w word characters (== [0-9A-Za-z_]) \W not word characters (== [^0-9A-Za-z_]) ASCII character classes: [[:alnum:]] alphanumeric (== [0-9A-Za-z]) [[:alpha:]] alphabetic (== [A-Za-z]) [[:ascii:]] ASCII (== [\x00-\x7F]) [[:blank:]] blank (== [\t ]) [[:cntrl:]] control (== [\x00-\x1F\x7F]) [[:digit:]] digits (== [0-9]) [[:graph:]] graphical (== [!-~] == [A-Za-z0-9!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~]) [[:lower:]] lower case (== [a-z]) [[:print:]] printable (== [ -~] == [ [:graph:]]) [[:punct:]] punctuation (== [!-/:-@[-`{-~]) [[:space:]] whitespace (== [\t\n\v\f\r ]) [[:upper:]] upper case (== [A-Z]) [[:word:]] word characters (== [0-9A-Za-z_]) [[:xdigit:]] hex digit (== [0-9A-Fa-f]) Unicode character classes are those in unicode.Categories and unicode.Scripts. func IsWordChar(r rune) bool type EmptyOp uint8 const EmptyBeginLine EmptyOp = 1 << iota ... func EmptyOpContext(r1, r2 rune) EmptyOp type Error struct{ ... } type ErrorCode string const ErrInternalError ErrorCode = "regexp/syntax: internal error" ... type Flags uint16 const FoldCase Flags = 1 << iota ... type Inst struct{ ... } type InstOp uint8 const InstAlt InstOp = iota ... type Op uint8 const OpNoMatch Op = 1 + iota ... type Prog struct{ ... } func Compile(re *Regexp) (*Prog, error) type Regexp struct{ ... } func Parse(s string, flags Flags) (*Regexp, error)
One caveat: for strings in "regex position" -- e.g. the second argument to
sub
or
gsub
, or after =~
-- "\t"
means a backslash and a t
-- which is the right thing -- whereas for strings
in "non-regex position", e.g. anywhere else, "\t"
becomes the tab character.
This is to say (if you're familiar with r-strings in Python) all strings in
regex position are implicit r-strings. Generally this is the right thing and
should cause little confusion. Note however that this means "\t"."\t"
in the
second argument to sub
isn't the same as "\t\t"
.