Interrupt Labs is often asked to undertake dynamic analysis of stripped closed-source binaries. Even after recovering a lot of information through static analysis (often with IDA), the lack of symbols makes dynamic analysis challenging (using IDA’s debugging features isn’t always possible). In one particular case, we were presented with a stripped VxWorks image and managed to recover many symbols statically but were unable to use them to debug outside of IDA. I was aware that GDB could load symbols from .sym files and wondered if it was possible to extract the information from an IDA Database (.idb or .i64) into a .sym file. This post details my journey into the internals of these two file formats and will cover:
What a .sym file is.
What information needed to create a .sym file.
What challenges arise in extracting that information from an IDA Database.
What challenges arise in creating a .sym file with the extracted information.
How the resultant .sym file was tested.
A detailed appendix with information about the IDA Database format.
The tool created as a result of this research is available here
What's a .sym?
My first challenge was working out what a VxWorks .sym file is. I wasn’t provided with any examples and could find no decisive documentation. After some searching I discovered the following from the “VxWorks Kernel Programmer Guide 6.2”:
A vxWorks.sym file is created during image building if INCLUDE_STANDALONE_SYM_TBL is not set (otherwise the symbols are bundled with the image).
objcopy somehow has the ability to create .sym files.
With this knowledge, I began searching GitHub and eventually found a repository with VxWorks build tools. This enabled me to find the link between INCLUDE_STANDALONE_SYM_TBL and objcopy:
Here is the relevant section from objcopy‘s man page:
--extract-symbol
Keep the file's section flags and symbols but remove all section data.
Specifically, the option:
*<removes the contents of all sections;>
*<sets the size of every section to zero; and >
*<sets the file's start address to zero.>
This option is used to build a .sym file for a VxWorks kernel.
This suggested that a .sym file was really just an ELF file with the non-debugging stuff removed. Now I just needed to find some examples for confirmation and reference.
I set out searching GitHub again and eventually wrote a Python script to download all files named vxworks.sym.
Script
from itertools import count
from os import environ
from pathlib import Path
from time import sleep
from urllib.request import urlretrieve
import requests
try:
PAT = environ["GITHUB_PAT"]
except IndexError:
print("Set the GITHUB_PAT environment variable to a personal access token from:")
print("https://github.com/settings/tokens")
exit(1)
DOWNLOAD_FOLDER = (Path(__file__).parent.parent / "vxworks" / "syms").resolve()
DOWNLOAD_FOLDER.mkdir(parents=True, exist_ok=True)
downloaded_hashes = set()
for page in count(1):
results = requests.get(
"https://api.github.com/search/code",
headers={"Authorization": f"Token {PAT}"},
params={"q": "filename:vxworks.sym", "per_page": "100", "page": str(page)},
).json()["items"]
if len(results) == 0:
break
for result in results:
if (
result["name"].lower() == "vxworks.sym"
and result["sha"] not in downloaded_hashes
):
downloaded_hashes.add(result["sha"])
urlretrieve(
result["html_url"].replace("/blob/", "/raw/"),
DOWNLOAD_FOLDER / f"""{result["sha"]}.sym""",
)
print(result["sha"])
sleep(60)
After deduplication, I ended up with 64 examples. Here is the readelf -S output for one of them:
As expected, all the sections are empty except for:
.shstrtab – Contains strings for section names.
.symtab – Contains the symbol table.
.strtab – Contains strings for symbol names.
What to Extract?
My next task was working out what I needed to extract from the IDA Database file to create the .sym file. Luckily the ELF format is very well documented. Every symbol should have:
A name (stored in .strtab).
A value (address).
A type (either STT_OBJECT for a variable or STT_FUNC for a function).
A section reference (an index into the ELF’s section header table).
This meant that I needed to extract section information to create the section header table, along with function and variable information for the symbols themselves.
IDA Database Parsing
The next step was to parse the IDA Database and extract the necessary information. This wasn’t easy since I didn’t have access to the IDA SDK, but luckily there were some open-source tools that I could use for reference.
There were two sections of the IDA Database that interested me:
NAM – Contains the addresses of names. Has a simple array structure.
ID0 – Contains almost all other useful information. Has a complex B-tree structure.
Whilst parsing NAM was simple, parsing ID0 required many steps:
Extracting the B-tree structures.
Converting the B-tree structures to a more Python-friendly format.
Writing a ranged B-tree search algorithm.
Implementing IDA’s proprietary “NetNode” system.
Writing the correct “NetNode” queries to extract the required information.
.sym Creation
The final problem was the creation of the .sym file. Building a program to read and write ELF files was fairly easy thanks to extensive documentation and copious examples. The challenge came from the difference between IDA section metadata and ELF section metadata.
IDA sections have:
A name.
A start address.
A size.
Three flags: read, write and execute (having none of these indicates unknown permissions).
ELF sections have:
A name.
A start address.
A size.
Over 10 flags.
A type.
To resolve this difference I needed to make a few educated guesses:
For flags:
Start with any that can be inferred from the section name.
Add SHF_ALLOC if the section name is not recognised.
Add SHF_WRITE if the section’s write flag is set (or it has no flags set).
Add SHF_EXECINSTR if the section’s execute flag is set (or it has no flags set).
For type:
If it can be inferred from the section name, use that.
Otherwise use SHF_PROGBITS.
Testing
To test I started by writing a simple C program:
#include <stdio.h>
int number = 0;
void or() {
number |= 1;
printf("Number: %d\n", number);
}
void shift() {
number <<= 1;
printf("Number: %d\n", number);
}
int main() {
or();
shift();
return 0;
}
And compiled it (stripping symbols):
gcc test.c -no-pie -s -o test
Next, I had an IDA Database created for the program with the global variable number and function or labelled. I then used the tool to convert the database to a .sym file:
After loading the program into GDB I ran the following:
(gdb) symbol-file test.sym
Reading symbols from test.sym...
(No debugging symbols found in test.sym)
(gdb) b or
Breakpoint 1 at 0x401136
(gdb) r
Starting program: /mnt/hgfs/symbols-converter/vxworks/test
Breakpoint 1, 0x0000000000401136 in ?? ()
(gdb) p (int) number
$1 = 0
(gdb) si 10
0x000000000040115f in ?? ()
(gdb) p (int) number
$2 = 1
Which shows that the .sym file can be loaded and allows debugging as expected.
Extension
I ended up extending the tool by adding a few more features:
An option to include automatic (sub_) function names.
An option to input from a Ghidra XML export rather than an IDA Database.
An option to export to JSON or text rather than a .sym file.
This appendix was created with reference to databases created in IDA Pro 7.7. Integers are unsigned and little-endian unless otherwise specified.
Header
Offset
Size
Field
Purpose
0x00
4
magic
Should be either IDA0, IDA1 or IDA2. IDA0 and IDA1 imply that the file has a 32-bit word size, IDA2 implies that it has a 64-bit word size.
0x06
8
id0_offset
Offset to the ID0 section from the start of the file.
0x0d
8
id1_offset
Offset to the ID1 section from the start of the file.
0x1a
4
signature
Should be 0xaabbccdd.
0x1e
2
version
Should be 6.
0x20
8
nam_offset
Offset to the NAM section from the start of the file.
0x28
8
seg_offset
Offset to the SEG section from the start of the file.
0x30
8
til_offset
Offset to the TIL section from the start of the file.
0x38
4
id0_checksum
CRC32 checksum of the ID0 section.
0x3c
4
id1_checksum
CRC32 checksum of the ID1 section.
0x40
4
nam_checksum
CRC32 checksum of the NAM section.
0x44
4
seg_checksum
CRC32 checksum of the SEG section.
0x48
4
til_checksum
CRC32 checksum of the TIL section.
0x4c
8
id2_offset
Offset to the ID2 section from the start of the file.
0x50
4
id2_checksum
CRC32 checksum of the ID2 section.
Section Header
All sections have the following header. The section contents start immediately after.
Offset
Size
Field
Purpose
0x00
1
compression_method
0 for no compression. 2 for Zlib compression.
0x01
8
section_length
The length of the section (before decompression).
ID0
Header
Offset
Size
Field
Purpose
0x00
4
next_free_index
The index of the next free page.
0x04
2
page_size
The number of bytes occupied by a single page.
0x06
4
root_page_index
The index of the root page.
0x0a
4
record_count
The number of non-dead records.
0x0e
4
page_count
The number of non-dead pages.
0x13
9
magic
Should be B-tree v2.
page_size - 0x1c bytes of padding follow the header.
B-Tree
Introduction
The contents of ID0 are laid out as a B-tree. A B-tree is similar to a binary search tree except that each page (collection of records) may have more than two children. Each record has a key (shown in the diagram below) and a value.
Page
Every page starts with the following header:
Offset
Size
Field
Purpose
0x00
4
first_page_index
The index of the first (left-most) child page. If this is 0 then the page is a leaf page, otherwise, it is an index page.
0x04
2
count
The number of records in the page.
After this, there is a count length array of record meta structures.
Index Record Meta
Offset
Size
Field
Purpose
0x00
4
page_index
The index of the child page to the right of the record.
0x04
2
record_offset
The offset from the start of the page to the record.
Leaf Record Meta
Offset
Size
Field
Purpose
0x00
4
indent
The number of bytes to prepend to this record’s key from the start of the last (next-left) record’s key.
0x04
2
record_offset
The offset from the start of the page to the record.
Record
Each field follows immediately from the last.
Size
Field
Purpose
2
key_length
The length of the record’s key (without indent).
key_length
key
The record’s key.
2
value_length
The length of the record’s value.
value_length
value
The record’s value.
Net Nodes
Introduction
Net nodes are IDA’s method of grouping records related to something (often an address). Each net node has an integer node ID. It may also have a string node ID which can be resolved to the integer node ID.
The records inside a net node are identified by a tag (single byte value) and index (4 or 8-byte value). The B-tree structure makes it efficient to find all records with a given tag.
String Node ID
String node IDs can be resolved to integer node IDs by searching the B-tree for a record with the key:
N<string_node_id>
The record’s value gives the integer node ID.
Name
Some nodes have a string name. This can be found by searching the B-tree for a record with the key:
<node_idd>N
node_id is big-endian with a size matching file’s word size (see Header). The record’s value gives the name.
Records
You can find a record with a specific index and tag by searching for the key:
.<node_id>N
Both node_idand index are big-endian with sizes matching the file’s word size.
All records with a given tag can be found by performing a ranged search on the B-tree.
Variable Length Integers
Record values often use IDA’s proprietary variable-length integer formats.
Up to two bytes (T):
If the first byte begins with 0b11, the value is stored in the following two bytes.
Else if the first byte begins with 0b1, the value is stored in the remainder of the first byte and the following byte.
Else the value is stored in the first byte.
Up four bytes (U):
If the first byte begins with 0b111, the value is stored in the following four bytes.
Else if the first byte begins with 0b11, the value is stored in the remainder of the first byte and the following four bytes.
Else if the first byte begins with 0b1, the value is stored in the remainder of the first byte and the following byte.
Else the value is stored in the first byte.
Up to eight bytes (V) – Stored as two consecutive Us. The firstU is the lower four bytes, and the second is the upper four bytes.
Up to word size (W) – U if the word size if the file’s word size is 32-bits and V if it is 64-bits.
All are big-endian.
Analysis
Types use the letters from Variable Length Integers. Upper-case means unsigned and lower-case means signed.
Segments (Sections)
Segment information is found in the $ funcs net node. Every record with the tag S has the format:
Type
Field
Description
W
start
The start address of the segment.
W
length
The length of the segment.
W
name_index
The index of the segment’s name in $ segstrings (covered later).
W
class_index
The index of the segment’s class in $ segstrings (covered later).
Please click on "Preferences" to confirm your cookie preferences. By default, the essential cookies are always activated. View our Cookie Policy for more information.