How to add support for a new language
This is about adding support for a new programming language in semgrep using the tree-sitter technology. While new languages should use tree-sitter, semgrep also supports some languages independently if there's a good legacy OCaml parser for them. Check for your language in pfff and if you see it in there, talk to us. Otherwise, let's get started.
Repositories involved directly:
- semgrep: the semgrep command line program;
- ocaml-tree-sitter-semgrep: language-specific setup, generates C/OCaml parsers for semgrep;
- new repo semgrep-X for the new language X: C/OCaml parser generated from ocaml-tree-sitter-semgrep by an admin.
Submodules overview (semgrep repo)
There are quite a few GitHub repositories involved in porting a language. Here is the file hierarchy of the semgrep repository:
/semgrep-core/src
├── ocaml-tree-sitter-core # runtime library for tree-sitter parsers
├── pfff # non-tree-sitter parsers
└── tree-sitter-lang # generated tree-sitter parsers
├── semgrep-java
...
└── semgrep-ruby
When done with the work in ocaml-tree-sitter-semgrep, you'll need a new repo semgrep-X to host the generated parser code. Ask someone from the Semgrep team to create one for you. For this, they should use the template semgrep-lang-template when creating the repo.
The instructions for adding a language start in ocaml-tree-sitter-semgrep (as indicated below). Be careful that you are always in the correct repo!
Setup (ocaml-tree-sitter-semgrep repo)
As a model, you can use the existing setup for ruby
or javascript
. Our
most complicated setup is for typescript
and tsx
.
Expedited setup
If you're lucky, the language you want to add can be added with the
script add-simple-lang
:
cd lang
./add-simple-lang --help
follow the instructions from --help
This often works with languages that define a single dialect using a
grammar.js
file at the root of the project. If this simplified
approach fails, use the Manual setup instructions below to understand
what's going on or to set things up manually.
Manual setup
From the ocaml-tree-sitter repo, do the following:
- Create a
lang/X
folder. - Make a
test/ok
directory. Inside the directory, create a simplehello-world
program for the language you are porting. Name the programhello-world.<ext>
. - Now make a file called
extensions.txt
and input all the language extensions (.rb, .kt, etc) for your language in the file. - Create a file called
fyi.list
with all the information files, such assemgrep-grammars/src/tree-sitter-X/LICENSE
,semgrep-grammars/src/tree-sitter-X/grammar.js
,semgrep-grammars/src/semgrep-X/grammar.js
, etc. to bundle with the final OCaml/C project. - Link the Makefile.common to a Makefile in the directory with:
ln -s ../Makefile.common Makefile
- Create a test corpus. You can do this by:
- Running
most-starred-for-language
in order to gather projects on which to run parsing stats. Run with the following command:./scripts/most-starred-for-language <lang> <github_username> <api_key>
- Using github advanced search to find the most starred or most forked repositories.
- Running
- Copy the generated
projects.txt
file into thelang/X
directory. - Add in extra projects and extra input sets as you see necessary.
Here's the file hierarchy for Ruby:
lang/ruby # language name of the form [a-z][a-z0-9]*
├── extensions.txt # standard name. Required for stats.
├── fyi.list # list of informational files to copy. Recommended.
├── Makefile -> ../Makefile.common
├── projects.txt # standard name. Required for stats.
└── test # sample input files
├── ok # contains input files supported by the current grammar
│ ├── comment.rb
│ ├── ex1.rb
│ ├── ex2.rb
│ ├── hello.rb
│ └── poly.rb
└── xfail # contains input files that are expected to fail
└─ ─ rating.rb
To test a language in ocaml-tree-sitter, you must build the ocaml-tree-sitter OCaml code generator, run it to produce a parser, then run some tests for the parser. Full instructions for this are given in updating-a-grammar under "Testing". The short instructions are:
- For the first time, build everything with
./scripts/rebuild-everything
. - Subsequently, work from the
lang/X
folder and runmake
andmake test
.
The fyi.list
file
The fyi.list
file was created to specify informational files that
should accompany the generated files. These files are typically:
- the source grammar, most often a single
grammar.js
file. - the licensing conditions usually specified in a
LICENSE
file.
Example:
# Comments are allowed on their own line.
# Blank lines are ok.
# Each path is relative to ocaml-tree-sitter/lang
semgrep-grammars/src/tree-sitter-ruby/LICENSE
semgrep-grammars/src/tree-sitter-ruby/grammar.js
semgrep-grammars/src/semgrep-ruby/grammar.js
The files listed in fyi.list
end up in a fyi
folder in
tree-sitter-lang. For example,
see ruby/fyi
.
Extending the original grammar with semgrep syntax
This is best done after everything else is set up. Some constructs
such as semgrep metavariables ($FOO
) may already be valid constructs
in the language, in which case there's nothing to do. Some support for
the semgrep ellipsis ...
usually needs to be added as well.
You'll need to learn how to create tree-sitter grammars.
- Work from
semgrep-grammars/src/semgrep-X
and usemake
andmake test
to build and test. - Add new test cases to
test/corpus/semgrep.text
. - Edit
grammar.js
. - Refer to the original grammar in
semgrep-grammars/src/tree-sitter-X
to determine which rules to extend.
For an example of how to extend a language, you can:
- Look at what was done for the semgrep extensions of other languages
in their respective
semgrep-*
folders. - Look at how tree-sitter-typescript extends the javascript grammar.
This is the file
common/define-grammar.js
in the tree-sitter-typescript repo.
Avoiding parsing conflicts is the trickiest part. Asking for help is encouraged.
💡 A note on the JavaScript syntax that's heavily used to define and extend grammars:
When possible, we prefer the shorthand notation for anonymous functions made of a single expression:
(x) => x
which is the same as
(x) => { return x; }
which is itself the same as
function(x) { return x; }
When extending any rule with an alternate choice such as $.ellipsis
,
the simpler way is this one:
expression: ($, previous) => choice(previous, $.ellipsis),
However, if the previous
rule is known to be a choice()
, we can avoid
one level of nesting and append to the original list of choices, which
is done as follows:
expression: ($, previous) => choice(...previous.members, $.ellipsis),
Whether to use one or the other is a matter of taste.
Finally, on rare occasions where the rule body is more than a single expression, you'll have to use the curly-brace/return syntax:
expression: ($, previous) => {
if (semgrep_ext)
return choice(...previous.members, $.ellipsis);
else
return previous;
},
Parsing statistics
From a language's folder such as lang/csharp
, two targets are
available to exercise the generated parser:
make test
: runs ontest/ok
andtest/xfail
make stat
: downloads the code specified inprojects.txt
and parses the files whose extension matches those inextensions.txt
, reporting parsing success in the form of a CSV file.
For gathering a good test corpus, you can use GitHub
Search or the script provided in
scripts/most-starred-for-language.py
. For github searches, filter by
programming language and use a constraint to select large projects,
such as "> 100 forks". Collect the repository URLs and put them into
projects.txt
.
Publishing generated parsers
After you have pushed your ocaml-tree-sitter changes to the main branch, do the following:
- Check that the original
grammar.js
,src/scanner.c
/.cc
(if applicable) look clean and have minimal external dependencies. - In
ocaml-tree-sitter/lang/Makefile
, add language under 'SUPPORTED_LANGUAGES' and 'STAT_LANGUAGES'. - In
ocaml-tree-sitter/lang
directory, run./release X --dry-run
. If this looks good, please ask someone from the Semgrep team to publish the code using./release X
.
Troubleshooting
Various errors can occur along the way.
Compilation errors in C or C++ are usually due to a missing source
file scanner.c
or scanner.cc
, or a grammar with a name that
doesn't match the name inside the scanner file. JavaScript files may
also be missing, in particular in the case of grammars that extend
existing grammars such as C++ for C or TypeScript for
JavaScript. Check for require()
calls in grammar.js
and learn how
this NodeJS primitive resolves paths.
There may also be errors when generating or compiling OCaml code. These are likely bugs in ocaml-tree-sitter and they should be reported or fixed right away.
Here are some known types of parsing errors:
- A syntax error. The input program is in the wrong syntax or uses a
recent feature that's not supported yet:
make test
or directly theparse_X
program will show the tree produced by tree-sitter with one or moreERROR
nodes. - A "reparsing" error. It's an error generated after the first
successful parsing pass by the tree-sitter parser, during the
reparsing pass by the OCaml code performed by the generated
Parse.ml
file. The error message should tell you something like "cannot interpret tree-sitter's output", with details on what code failed to match what pattern. This is most likely a bug in ocaml-tree-sitter. - A segmentation fault. This could be due to a bug in the OCaml/tree-sitter C bindings and should be fixed. A simple test case that reproduces the problem would be nice. See https://github.com/semgrep/ocaml-tree-sitter-semgrep/issues/65
Parsing errors that are due
to an incomplete or incorrect grammar should be recorded, and
eventually reported and/or fixed in the upstream project.
We keep failing test cases in a fail/
folder, preferably in the form
of the minimal program suitable for a bug report, with a comment
describing what was expected and what's going on.
pfff
Pfff defines a list programming languages, some of which have parsers in pfff itself. Others are tree-sitter parsers which are otherwise independent from pfff. You need to add the new language to the list of languages in pfff.
Look under Adding a Language in pfff for step-by-step instructions.
semgrep-core
Now that you have added your new language 'X' to pfff, do the following:
- Add the new pfff submodule to semgrep-core.
- In
Check_pattern.ml
, add 'X' tolang_has_no_dollar_ids
/ If the grammar has no dollar identifiers, add it above 'true'. Otherwise, add it above 'false'. - In
synthesizing/Pretty_print_generic.ml
, add 'X' to the appropriate functions:- print_bool
- if_stmt
- while_stmt
- do_while
- for_stmt
- def_stmt
- return
- break
- continue
- literal
- In
parsing/Test_parsing.ml
, add in 'X' todump_tree_sitter_cst_lang
. You can look to the other languages as reference to what code to add. - Create a file
parsing/Parse_X_tree_sitter.ml
. Add basic functionality to define the functionparse
and import moduleParse_tree_sitter_helpers
. You can look at csharp and kotlin files in order to get a better idea of how to define the parse file function, but this file should contain something similar to:module H = Parse_tree_sitter_helpers
let parse file =
H.wrap_parser
(fun () ->
Parallel.backtrace_when_exn := false
Parallel.invoke Tree_sitter_X.Parse.file file ()
) - In
parsing/tree_sitter/dune
, addtree-sitter-lang.X
. - Write a basic test case for your language in
tests/X/hello-world.X
. This can just be a hello-world function. - Test that the command
semgrep-core/bin/semgrep-core -dump_tree_sitter_cst test/X/hello-world
prints out a CST for your language.
Legal concerns
Be thankful for the authors of the original code, keep clearly visible license notices, and make it easy to get back to the original projects:
- Make sure to preserve the
LICENSE
files. This should be listed in thefyi.list
file. - For sample input in
test/
, consider Public Domain ("The Unlicense") files or write your own, for simplicity. GitHub Search allows you to filter projects by license and by programming language.
See also
Not finding what you need in this doc? Ask questions in our Community Slack group, or see Support for other ways to get help.