Converting IDA DB to VxWorks .sym

by

Joseph B

July

2022

Goal

Interrupt Labs is often asked to undertake dynamic analysis of stripped closed-source binaries. Even after recovering a lot of information through static analysis (often with IDA), the lack of symbols makes dynamic analysis challenging (using IDA’s debugging features isn’t always possible). In one particular case, we were presented with a stripped VxWorks image and managed to recover many symbols statically but were unable to use them to debug outside of IDA. I was aware that GDB could load symbols from .sym files and wondered if it was possible to extract the information from an IDA Database (.idb or .i64) into a .sym file. This post details my journey into the internals of these two file formats and will cover:

  • What a .sym file is.
  • What information needed to create a .sym file.
  • What challenges arise in extracting that information from an IDA Database.
  • What challenges arise in creating a .sym file with the extracted information.
  • How the resultant .sym file was tested.
  • A detailed appendix with information about the IDA Database format.

The tool created as a result of this research is available here

What's a .sym?

My first challenge was working out what a VxWorks .sym file is. I wasn’t provided with any examples and could find no decisive documentation. After some searching I discovered the following from the “VxWorks Kernel Programmer Guide 6.2”:

  • A vxWorks.sym file is created during image building if INCLUDE_STANDALONE_SYM_TBL is not set (otherwise the symbols are bundled with the image).
  • objcopy somehow has the ability to create .sym files.

With this knowledge, I began searching GitHub and eventually found a repository with VxWorks build tools. This enabled me to find the link between
INCLUDE_STANDALONE_SYM_TBL and objcopy:


ifneq   ($(findstring INCLUDE_NET_SYM_TBL, $(COMPONENTS)),)
define MAKE_SYM
	$(BINXSYM) $@ $@.sym
endef
define MAKE_SYM_CVT
	$(LDOUT_SYMS) $@.sym
endef
endif


BINXSYM		= $(ENV_BIN)$(


EXTRACT_SYM_FLAG= --extract-symbol

Here is the relevant section from objcopy‘s man page:


--extract-symbol
    Keep the file's section flags and symbols but remove all section data.
    Specifically, the option: 
    *<removes the contents of all sections;>
    *<sets the size of every section to zero; and >
    *<sets the file's start address to zero.>
    This option is used to build a  .sym file for a VxWorks kernel.

This suggested that a .sym file was really just an ELF file with the non-debugging stuff removed. Now I just needed to find some examples for confirmation and reference.

I set out searching GitHub again and eventually wrote a Python script to download all files named vxworks.sym.

Script
from itertools import count
from os import environ
from pathlib import Path
from time import sleep
from urllib.request import urlretrieve
import requests
try:
    PAT = environ["GITHUB_PAT"]
except IndexError:
    print("Set the GITHUB_PAT environment variable to a personal access token from:")
    print("https://github.com/settings/tokens")
    exit(1)
DOWNLOAD_FOLDER = (Path(__file__).parent.parent / "vxworks" / "syms").resolve()
DOWNLOAD_FOLDER.mkdir(parents=True, exist_ok=True)
downloaded_hashes = set()
for page in count(1):
    results = requests.get(
        "https://api.github.com/search/code",
        headers={"Authorization": f"Token {PAT}"},
        params={"q": "filename:vxworks.sym", "per_page": "100", "page": str(page)},
    ).json()["items"]
    if len(results) == 0:
        break
    for result in results:
        if (
            result["name"].lower() == "vxworks.sym"
            and result["sha"] not in downloaded_hashes
        ):
            downloaded_hashes.add(result["sha"])
            urlretrieve(
                result["html_url"].replace("/blob/", "/raw/"),
                DOWNLOAD_FOLDER / f"""{result["sha"]}.sym""",
            )
            print(result["sha"])
    sleep(60)   
	

After deduplication, I ended up with 64 examples. Here is the readelf -S output for one of them:


  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        00000000 0000c0 000000 00  AX  0   0 64
  [ 2] .wrs_build_vars   PROGBITS        00000000 0000c0 000000 00  WA  0   0  1
  [ 3] .sdata2           PROGBITS        00000000 0000c0 000000 00  AX  0   0  8
  [ 4] .data             PROGBITS        00000000 0000c0 000000 00  WA  0   0  8
  [ 5] .tls_data         PROGBITS        00000000 0000c0 000000 00  WA  0   0  1
  [ 6] .tls_vars         PROGBITS        00000000 0000c0 000000 00  WA  0   0  1
  [ 7] .sdata            PROGBITS        00000000 0000c0 000000 00  WA  0   0  8
  [ 8] .sbss             NOBITS          00000000 0000c0 000000 00  WA  0   0  8
  [ 9] .bss              NOBITS          00000000 0000c0 000000 00  WA  0   0 16
  [10] .boot             PROGBITS        00000000 0000c0 000000 00      0   0  1
  [11] .reset            PROGBITS        00000000 0000c0 000000 00      0   0  1
  [12] .debug_frame      PROGBITS        00000000 0000c0 000000 00      0   0  1
  [13] .PPC.EMB.apuinfo  PROGBITS        00000000 0000c0 000000 00      0   0  1
  [14] .shstrtab         STRTAB          00000000 0000c0 000090 00      0   0  1
  [15] .symtab           SYMTAB          00000000 0003f8 01e460 10     16 3271  4
  [16] .strtab           STRTAB          00000000 01e858 01fa38 00      0   0  1

As expected, all the sections are empty except for:

  • .shstrtab – Contains strings for section names.
  • .symtab – Contains the symbol table.
  • .strtab – Contains strings for symbol names.

What to Extract?

My next task was working out what I needed to extract from the IDA Database file to create the .sym file. Luckily the ELF format is very well documented. Every symbol should have:

  • A name (stored in .strtab).
  • A value (address).
  • A type (either STT_OBJECT for a variable or STT_FUNC for a function).
  • A section reference (an index into the ELF’s section header table).

This meant that I needed to extract section information to create the section header table, along with function and variable information for the symbols themselves.

IDA Database Parsing

The next step was to parse the IDA Database and extract the necessary information. This wasn’t easy since I didn’t have access to the IDA SDK, but luckily there were some open-source tools that I could use for reference.

There were two sections of the IDA Database that interested me:

  • NAM – Contains the addresses of names. Has a simple array structure.
  • ID0 – Contains almost all other useful information. Has a complex B-tree structure.

Whilst parsing NAM was simple, parsing ID0 required many steps:

  • Extracting the B-tree structures.
  • Converting the B-tree structures to a more Python-friendly format.
  • Writing a ranged B-tree search algorithm.
  • Implementing IDA’s proprietary “NetNode” system.
  • Writing the correct “NetNode” queries to extract the required information.

.sym Creation

The final problem was the creation of the .sym file. Building a program to read and write ELF files was fairly easy thanks to extensive documentation and copious examples. The challenge came from the difference between IDA section metadata and ELF section metadata.

IDA sections have:

  • A name.
  • A start address.
  • A size.
  • Three flags: read, write and execute (having none of these indicates unknown permissions).

ELF sections have:

  • A name.
  • A start address.
  • A size.
  • Over 10 flags.
  • A type.

To resolve this difference I needed to make a few educated guesses:

  • For flags:
    • Start with any that can be inferred from the section name.
    • Add SHF_ALLOC if the section name is not recognised.
    • Add SHF_WRITE if the section’s write flag is set (or it has no flags set).
    • Add SHF_EXECINSTR if the section’s execute flag is set (or it has no flags set).
  • For type:
    • If it can be inferred from the section name, use that.
    • Otherwise use SHF_PROGBITS.

Testing

To test I started by writing a simple C program:

#include <stdio.h>
int number = 0;
void or() {
    number |= 1;
    printf("Number: %d\n", number);
}
void shift() {
    number <<= 1;
    printf("Number: %d\n", number);
}
int main() {
    or();
    shift();
    return 0;
}

And compiled it (stripping symbols):

gcc test.c -no-pie -s -o test

Next, I had an IDA Database created for the program with the global variable number and function or labelled. I then used the tool to convert the database to a .sym file:

python -m sc -i test.i64 -s test.sym -e little --type EXEC --machine X86_64

After loading the program into GDB I ran the following:

(gdb) symbol-file test.sym 
Reading symbols from test.sym...
(No debugging symbols found in test.sym)
(gdb) b or
Breakpoint 1 at 0x401136
(gdb) r
Starting program: /mnt/hgfs/symbols-converter/vxworks/test 
Breakpoint 1, 0x0000000000401136 in ?? ()
(gdb) p (int) number
$1 = 0
(gdb) si 10
0x000000000040115f in ?? ()
(gdb) p (int) number
$2 = 1

Which shows that the .sym file can be loaded and allows debugging as expected.

Extension

I ended up extending the tool by adding a few more features:

  • An option to include automatic (sub_) function names.
  • An option to input from a Ghidra XML export rather than an IDA Database.
  • An option to export to JSON or text rather than a .sym file.

It is available here.

Appendix – IDB

Introduction

This appendix will talk about the structure of an IDA Database (.idb or .i64 file. IDA databases can contain up to six sections:

  • ID0 – Contains most useful information.
  • ID1 – Contains flags for each byte in the binary.
  • NAM – Contains addresses of names.
  • SEG – Unknown.
  • TIL – Contains information about data types.
  • ID2 – Unknown.

Only ID0 and NAM will be covered here.

Most of this information was obtained through reviewing the code of the following amazing projects:

This appendix was created with reference to databases created in IDA Pro 7.7. Integers are unsigned and little-endian unless otherwise specified.

Header

OffsetSizeFieldPurpose
0x004magicShould be either IDA0, IDA1 or IDA2. IDA0 and IDA1 imply that the file has a 32-bit word size, IDA2 implies that it has a 64-bit word size.
0x068id0_offsetOffset to the ID0 section from the start of the file.
0x0d8id1_offsetOffset to the ID1 section from the start of the file.
0x1a4signatureShould be 0xaabbccdd.
0x1e2versionShould be 6.
0x208nam_offsetOffset to the NAM section from the start of the file.
0x288seg_offsetOffset to the SEG section from the start of the file.
0x308til_offsetOffset to the TIL section from the start of the file.
0x384id0_checksumCRC32 checksum of the ID0 section.
0x3c4id1_checksumCRC32 checksum of the ID1 section.
0x404nam_checksumCRC32 checksum of the NAM section.
0x444seg_checksumCRC32 checksum of the SEG section.
0x484til_checksumCRC32 checksum of the TIL section.
0x4c8id2_offsetOffset to the ID2 section from the start of the file.
0x504id2_checksumCRC32 checksum of the ID2 section.

Section Header

All sections have the following header. The section contents start immediately after.

OffsetSizeFieldPurpose
0x001compression_method0 for no compression. 2 for Zlib compression.
0x018section_lengthThe length of the section (before decompression).

ID0

Header

OffsetSizeFieldPurpose
0x004next_free_indexThe index of the next free page.
0x042page_sizeThe number of bytes occupied by a single page.
0x064root_page_indexThe index of the root page.
0x0a4record_countThe number of non-dead records.
0x0e4page_countThe number of non-dead pages.
0x139magicShould be B-tree v2.

page_size - 0x1c bytes of padding follow the header.

B-Tree

Introduction

The contents of ID0 are laid out as a B-tree. A B-tree is similar to a binary search tree except that each page (collection of records) may have more than two children. Each record has a key (shown in the diagram below) and a value.

B-Tree

Page

Every page starts with the following header:

OffsetSizeFieldPurpose
0x004first_page_indexThe index of the first (left-most) child page. If this is 0 then the page is a leaf page, otherwise, it is an index page.
0x042countThe number of records in the page.

After this, there is a count length array of record meta structures.

Index Record Meta

OffsetSizeFieldPurpose
0x004page_indexThe index of the child page to the right of the record.
0x042record_offsetThe offset from the start of the page to the record.

Leaf Record Meta

OffsetSizeFieldPurpose
0x004indentThe number of bytes to prepend to this record’s key from the start of the last (next-left) record’s key.
0x042record_offsetThe offset from the start of the page to the record.

Record

Each field follows immediately from the last.

SizeFieldPurpose
2key_lengthThe length of the record’s key (without indent).
key_lengthkeyThe record’s key.
2value_lengthThe length of the record’s value.
value_lengthvalueThe record’s value.

Net Nodes

Introduction

Net nodes are IDA’s method of grouping records related to something (often an address). Each net node has an integer node ID. It may also have a string node ID which can be resolved to the integer node ID.

The records inside a net node are identified by a tag (single byte value) and index (4 or 8-byte value). The B-tree structure makes it efficient to find all records with a given tag.

String Node ID

String node IDs can be resolved to integer node IDs by searching the B-tree for a record with the key:

N<string_node_id>

The record’s value gives the integer node ID.

Name

Some nodes have a string name. This can be found by searching the B-tree for a record with the key:

<node_idd>N

node_id is big-endian with a size matching file’s word size (see Header). The record’s value gives the name.

Records

You can find a record with a specific index and tag by searching for the key:

.<node_id>N

Both node_idand index are big-endian with sizes matching the file’s word size.

All records with a given tag can be found by performing a ranged search on the B-tree.

Variable Length Integers

Record values often use IDA’s proprietary variable-length integer formats.

  • Up to two bytes (T):
    • If the first byte begins with 0b11, the value is stored in the following two bytes.
    • Else if the first byte begins with 0b1, the value is stored in the remainder of the first byte and the following byte.
    • Else the value is stored in the first byte.
  • Up four bytes (U):
    • If the first byte begins with 0b111, the value is stored in the following four bytes.
    • Else if the first byte begins with 0b11, the value is stored in the remainder of the first byte and the following four bytes.
    • Else if the first byte begins with 0b1, the value is stored in the remainder of the first byte and the following byte.
    • Else the value is stored in the first byte.
  • Up to eight bytes (V) – Stored as two consecutive Us. The firstU is the lower four bytes, and the second is the upper four bytes.
  • Up to word size (W) – U if the word size if the file’s word size is 32-bits and V if it is 64-bits.

All are big-endian.

Analysis

Types use the letters from Variable Length Integers. Upper-case means unsigned and lower-case means signed.

Segments (Sections)

Segment information is found in the $ funcs net node. Every record with the tag S has the format:

TypeFieldDescription
WstartThe start address of the segment.
WlengthThe length of the segment.
Wname_indexThe index of the segment’s name in $ segstrings (covered later).
Wclass_indexThe index of the segment’s class in $ segstrings (covered later).
Worg_baseDependant on the processor.
UflagsDetailed here.
Ualignment_codesUnknown.
Ucombination_codesUnknown.
UpermissionsFlags. 1 is read, 2 is write, 4 is execute. 0 means unknown flags.
UbitnessThe number of bits used for segment addressing. 0 is 16-bits, 1 is 32-bits, 2 is 64-bits.
UtypeDetermines how the kernel deals with the segment.
UselectorA unique value used to identify the segment.
UcolourThe segment’s colour. Subtract one for the RGBA value.

The $ segstrings net node has a record with tag S at index 0. This is an array of:

<string_length><string>

Where the string length is a single byte.

Functions

Function information is found in the $ funcs net node. Every record with the tag S begins with:

TypeFieldDescription
WstartThe start address of the function chunk.
WlengthThe length of the function chunk.
TflagsFlags. 0x8000 means this is a tail chunk (it is a head chunk otherwise).

Head chunks then have the following:

TypeFieldDescription
WframeThe node ID of the frame net node.
Wlocals_sizeThe size of the local variables (bytes).
Tregisters_sizeThe size of the saved registers (bytes).
Warguments_sizeThe size of the stack arguments (bytes).

And tail chunks have:

TypeFieldDescription
wparent_offsetThe offset to the head chunk. Subtract this from start to get the address of the head chunk.
Ureferer_countThe number of referrers referencing this chunk.

Every function has one head chunk and zero or more tail chunks. The name of the head chunk is the function name.

The information documented here is only part of the function information that is available.

NAM

Header

Offset 32Offset 64SizeFieldPurpose
0x000x004magicShould be VA* followed by a null byte.
0x080x084non_empty0 if the section is empty, 1 otherwise.
0x100x104page_countThe number of pages (size 0x2000) occupied by the section.
0x180x1c4name_countThe number of name addresses in the section. If the file’s word size is 64-bits then this number needs to be halved.

If the file’s word size is 32-bits then 0x1fe4 bytes of padding follow the header and if it is 64-bits then 0x1fe0 bytes of padding follow it.

Name Addresses

A name_count array of integers matching the file’s word size. To resolve the integers to strings see Name.

Please click on "Preferences" to confirm your cookie preferences. By default, the essential cookies are always activated. View our Cookie Policy for more information.