Matching multiple tokens with ellipsis metavariables
Using ellipsis (...
) to match a sequence of items (for example, arguments, statements, or fields) is one of the most common constructs in Semgrep rules. Likewise, using metavariables ($VAR) to capture values (such as variables, functions, arguments, classes, and methods) is extremely common and powerful for tracking the use of values across a code scope.
Introduction to ellipsis metavariables
Ellipses can be combined with metavariables to increase matching scope from a single item to a sequence of items, while capturing the values for later re-use.
Most commonly, ellipsis metavariables like $...ARGS
are used for purposes like matching multiple arguments to a function or items in an array.
However, they can also be used to match multiple word tokens. As part of Semgrep's pattern matching, it separates the analyzed language into tokens, which are single units that make up a larger text. Some tokens, typically alphanumeric tokens, are "words", and some are word separators (like punctuation and whitespace).
Using ellipsis metavariables to match multiple word tokens is especially helpful in Generic pattern matching mode. Because this mode is generic, it's not aware of the semantics of any particular language, and that comes with caveats and limitations.
In generic mode, a word token that can be matched by a metavariable is defined as a sequence of characters in the set [A-z0-9_]
. So ABC_DEF
is one token, and a metavariable such as $VAR
captures the entire sequence. However, ABC-DEF
is two tokens, and a metavariable such as $VAR
does not capture the entire sequence.
Capturing multiple tokens with ellipsis metavariables
Not all languages you might match using generic mode share the same definition of word tokens. If you're matching patterns in one of these languages, your metavariables might not match as much of a word token as you expect. For example, in HTML, "ABC-DEF" is a single token (perhaps an id
value).
If the language you're working with allows other characters in tokens, using ellipsis metavariables can prevent problems with metavariables matching too little of the pattern.
To match all of ABC-DEF
in generic
mode, use an ellipsis metavariable, like $...VAR
. Here is an example rule:
If you remove the ellipsis in the $...ID
variable, the second example no longer matches.
Alternative: try the Aliengrep experiment
To address some of the limitations of generic mode, the team is experimenting with a new mode called Aliengrep.
With Aliengrep, you can configure what characters are allowed as part of a word token, so that you could match the HTML example with a single metavariable. You can also have even more fun with ellipses.
Give it a try and share your thoughts!
Not finding what you need in this doc? Ask questions in our Community Slack group, or see Support for other ways to get help.