Running an application security program involves the use of many detection tools, often with thousands of signatures each. The abundance of tooling and rapid pace of CVEs being published every year leads can lead to a massive amount of false positive alerts. Even worse than the endless hours spent triaging false positives is the fact that these tools often aren’t sophisticated enough or lack the context to find many of the actual vulnerabilities hiding within your codebase, and everyone can agree a false negative is even worse than a false positive. In addition to the mediocre detection capabilities of many of these tools they can also be very expensive and lack the ability to customize their detection signatures.

It’s clear that common web application threats of the past (XSS and SQLi) are becoming harder to find in production environments and much of this is due to improvements in detection tooling, however there are constantly new vulnerability classes or mutations of previous ones being discovered. Take a look at any of the high impact bug bounty reports that have been disclosed in recent years, it’s unlikely that any off the shelf security tool would catch these bugs. Although every once in a while a ridiculously easy to spot vulnerability surfaces, this isn’t the norm. Intead a deep understanding of the specific application and many iterations of manual testing are often required to find these issues.

Luckily in the past few years there’s massive contributions by the community to building open source tools which solve many of these issues.

In this series of posts I’ll demo a few open source tools that allow for fine grained customization of detection signatures that can allow for a more tailored scan for your specific application:

Part 1: Semgrep - SAST

Tux, the Linux mascot

“If a developer has to convince their manager to spend a few million dollars on advanced security tools each time they change jobs, the future is bleak.” - Semgrep philosophy

Semgrep is a code analysis tool that’s popular for it’s speed and easy to use rules engine. Originally developed as an internal tool at Facebook back in 2009, Semgrep has since been open sourced and is now maintained by r2c. The name is meant to be shorthand for “Semantic Grep“ which is an indication of how it works, in simple terms it’s a pattern matching engine. This might sound like many other basic linting/grepping tools that exist but what makes Semgrep stand out is the robust and dynamic rules that can be defined in a way to detect what the code is doing. Pair this with the fact that the syntax for writing rules is dead simple, this tool is excellent for customizing detection capabilities specific to your application.

The Semgrep engine is run locally so the source code of your application is never exposed to a 3rd party, unlike many other SaaS based AST tools which require you to upload the source. In addition to protecting your intellectual property, by running locally Semgrep does not introduce any latency by sending code over the wire or fetching rules per each scan. This makes it a great choice for integration in earlier stages of the SDLC, like in the IDE, git commit hooks or into your CI/CD pipeline. Security tools have a bad reputation for slowing down the build process, but that’s not an issue with Semgrep.

Semgrep supports most programming languages (20+), it also includes a registry of rulesets for these languages along with sets tailored for popular web application frameworks like Django and Express.

Example rule: log4j Tainted Argument

Before I get into an example I recommend anyone who has the time to read through the Semgrep documentation, it’s easy to understand and provides excellent sample rules/code to test out, as well as an interactive playground for testing.

For this post I’ll skip past the rudimentary examples and present one that’s designed to detect a vulnerability anyone in the security community should be able to understand: log4j

Here’s an embedded version of the Semgrep playground with the pre-configured rule log4j2 Tainted Argument

Click the Run button to test the rule against the sample Test Code provided:

I’ll try to breakdown what this rule is doing and highlight some of the Semgrep specific syntax being used here:

- pattern: $LOGGER.$METHOD(...);
- pattern-inside: |
import org.apache.log4j.$PKG;
$LOGGER = $PKG. ... ;
- pattern-not: $LOGGER.$METHOD("...");

We only need to cover 2 of the basic rule operators in Semgrep to understand this pattern:

  1. Metavariables

“Metavariables are an abstraction to match code when you don’t know the value or contents ahead of time, similar to capture groups in regular expressions.”

Metavariables are denoted by “$“, so when the rule includes this statement:

import org.apache.log4j.$PKG;

  • $PKG is essentially a placeholder variable for any package that’s imported from the org.apache.log4j library
  1. Ellipses

“The ellipsis operator (...) abstracts away a sequence of zero or more arguments, statements, parameters, fields, characters, etc.”

In our example we see the following:

- pattern: $LOGGER.$METHOD(...)

  • This will flag any code that’s within the $LOGGER.$METHOD function

- pattern-not: $LOGGER.$METHOD("...")

  • This will exclude any hardcoded strings from being flagged. The log4j vulnerability is only exploitable when user input is logged through it, so by excluding hardcoded strings we can reduce false positives.

Now that we understand how this pattern is constructed it’s easy to see how the rule is mapped to the test code provided:

$LOGGER.$METHOD(...); is the same"Request User Agent:" + userAgent);

How can we utilize Semgrep

The example I demonstrated is a unique use case that highlights the flexibility of Semgrep. If you think about how most organizations would identify this vulnerability it would likely be through the use of a dependency scanner (SCA) tool. While that approach works to identify the existence of log4j in your application, it doesn’t tell you if the logger is actually being used anywhere, and more specifically being used to log arbitrary user input.

It’s important to note that with any form of static analysis, the detection capabilities are never going to be perfect. Even in this example there could definitely be false positives detected. For example your application could exclusively be logging a variable that does not contain user input and this rule would still flag it. It’s also possible that there could be an implementation that logs user input but in a way that doesn’t fit the pattern we defined.

An efficient way to streamline the process of defining custom rules for your application is to leverage vulnerabilities you’ve discovered through other means, such as DAST tooling or manual penetration testing. For each of these vulnerabilities that aren’t discovered by your existing SAST tools, think about how a pattern could be defined to catch future occurences. Given the simplistic nature of Semgrep’s rules it doesn’t take significant effort or a seasoned software engineer to build these rules.