March
2024
This past summer Interrupt Labs partnered with the Defence Industry Internship Program (DIIP) to host two interns looking to dive into the world of vulnerability research. This article showcases the work of one of these interns, Samman Palihapitiya, who developed a Binary Ninja plugin called semgrep_bn
that allows researchers to run Semgrep queries natively in Binary Ninja. We’ve open-sourced the plugin here.
Introduction
As a vulnerability research (VR) intern at Interrupt Labs I got a crash course in code auditing, reverse engineering, and exploit development as part of their Vulnerability Researcher Development Program (VRDP). Afterwards, I set about applying the skills I developed to an 8-week project. As someone new to VR, I found the process of analysing binary code - especially when it came to stripped binaries - pretty overwhelming. There was a clear gap between having a foundational understanding of bug hunting and doing it in practice. To bridge this gap, for my project I developed a simple Binary Ninja plugin: semgrep_bn
. The plugin leverages the power of Semgrep for code scanning, presents findings in a digestible format, and enhances the overall efficiency of static binary analysis.
Semgrep is a lightweight static analysis tool with support for several popular programming languages. Users write `grep`-like queries that are applied over code. However, unlike grep
, Semgrep understands the structure and semantics of code; e.g., searching for “2” would return a match on x = 1; y = x + 1;
. Given the prevalence of powerful decompilers in popular interactive disassembly tools (e.g., Ghidra, Binary Ninja, IDA Pro), Semgrep seemed like a natural way to augment the vulnerability research process. This article describes my approach for integrating Semgrep with Binary Ninja.
Design
Dumping pseudo-C
Semgrep does not support assembly languages, so I first needed to lift the disassembled code into something that Semgrep could handle. While Binary Ninja provides a “pseudo-C” intermediate representation, it does not (at the time of writing) expose an API to directly extract pseudo-C code. While walking the high-level intermediate representation (HLIL) to produce my own pseudo-C was an option, I decided to reuse the following (simpler) approach from the PCDump-bn plugin:
# configure pseudo-C
settings = DisassemblySettings()
settings.set_option(DisassemblyOption.ShowAddress, False)
settings.set_option(DisassemblyOption.WaitForIL, True)
obj = LinearViewObject.language_representation(bv, settings)
cursor = LinearViewCursor(obj)
# get function body
cursor.seek_to_address(function.highest_address)
body = bv.get_next_linear_disassembly_lines(cursor)
# get function header
cursor.seek_to_address(function.highest_address)
header = bv.get_previous_linear_disassembly_lines(cursor)
pseudo_c = “\n”.join(str(line.contents) for line in header + body)
While this approach successfully dumps the pseduo-C at a function level, simply stitching these per-function dumps together does not create a valid C file. I ran into several issues that I needed to deal with: invalid identifiers, global variable declarations, and “syntax errors” in the pseudo-C dump.
Invalid Identifiers
The C language requires that identifiers (e.g., function names, variables, etc.) contain only letters, numbers, and underscores (and must start with either a letter or underscore). However, in Binary Ninja, users can call things whatever they want. This means that our pseudo-C dump may contain invalid identifiers (according to the C language). To resolve this issue, I added a preprocessing step that ensures all identifiers are “valid” and renames those that are not (prior to dumping the pseudo-C code).
Global Variables
As any C programmer knows, variables must be declared before they can be used. Thus, to generate “valid” C, I added a simple symbol analysis that ensured global variables were declared in the pseudo-C dump. This analysis leveraged Binary Ninja’s API to grab the names and types of “data symbols” that could be prepended to the pseudo-C dump generated above. While not strictly necessary for Semgrep, this is still useful for other source-based tools that require more “correct” C.
Correcting Syntax Errors with Tree-Sitter
With global variables successfully declared, I soon encountered a new challenge: Semgrep was skipping parts of the pseudo-C dump containing “invalid” code. For example, I found that Binary Ninja would annotate functions as __pure
or indicate that they are __tailcall
recursive. These annotations caused issues with Semgrep since they weren’t strictly valid C. I decided to use Tree-sitter to investigate these issues.
Tree-sitter is a parsing framework designed to work on a large swathe of programming languages. It parses code into concrete syntax trees representing the syntactic structure of the code, enabling querying and manipulation of code at a (slightly!) higher level of abstraction. In particular, tree-sitter provides a small declarative language for pattern matching elements in the syntax tree. Using this technique, I could locate and strip these annotations with the following query:
(function_definition
type: (primitive_type)
declarator: (function_declarator
declarator: (identifier)
parameters: (parameter_list)
. (identifier) @annotation))
There was a period where I was trying to identify the reasons Semgrep overlooked specific code. I would run the problematic code through the Tree-sitter Playground to devise and refine queries that could accurately match and resolve these issues.
While tree-sitter may seem like overkill for removing these annotations, it provides a powerful mechanism for further code analysis and transformation. As we deploy the plugin across a diverse range of binary executables, I anticipate encountering new scenarios where Semgrep may overlook certain lines of code due to annotations or naming conventions used in the Binary Ninja pseudo-C.
Presenting Results
Now we have some “valid” C code that we can run Semgrep over. As mentioned previously, Semgrep uses user-defined queries (sometimes refered to as rules). Fortunately, there are several open-source collections of queries that can be used for vulnerability research and code auditing, such as Semgrep’s rule-repository and 0xdea’s repository. The plugin provides a pop up to specify a Semgrep query file, runs the Semgrep query (or queries) over a Binary Ninja project, and then displays the results in a tab in Binary Ninja.
Recall that in the pseudo-C dumping code from PCDump-bn our LinearViewObject
was configured to exclude addresses in the disassembly settings. Therefore, to map Semgrep results to specific addresses we iterate through each function and capture the function’s start addresses and its pseudo-C representation, storing these as paired tuples. Upon completing the Semgrep analysis, the script uses this data to find the addresses within the binary. This enables direct click-through from Semgrep results to addresses in the disassembly view.
Having access to Semgrep results within Binary Ninja enables quick and simple vulnerability analysis, particularly with queries written to detect complex vulnerabilities like use-after-free and double-free. These vulnerabilities can be hard to spot for an inexperienced researcher, as they are generally not identifiable by a single line of code but rather require semantic analysis across the binary. Semgrep excels in this regard, and with semgrep_bn
, analysts can quickly be directed to these potential security issues.
Conclusion
Overall, semgrep_bn
delivered the functionality that I had sought to support my vulnerability research journey. The plugin enables quick and seamless use of Semgrep analyses within Binary Ninja, making the ramp up on performing vulnerability research a lot easier. Furthermore, using several techniques to generate improved pseudo-C decompilation opens the door to leveraging other source-code analysis tools. This was an unexpected outcome for me, but definitely something that could be utilised for other vulnerability research activities in the future.
semgrep_bn
has proven to be a useful tool in my vulnerability research learning, and I hope it can be of use to others as well. Once again, the code for the plugin has been open-sourced and is available here.