Converting IDA DB to VxWorks .sym

Joseph B

July

2022

Goal

Interrupt Labs is often asked to undertake dynamic analysis of stripped closed-source binaries. Even after recovering a lot of information through static analysis (often with IDA), the lack of symbols makes dynamic analysis challenging (using IDA’s debugging features isn’t always possible). In one particular case, we were presented with a stripped VxWorks image and managed to recover many symbols statically but were unable to use them to debug outside of IDA. I was aware that GDB could load symbols from .sym files and wondered if it was possible to extract the information from an IDA Database (.idb or .i64) into a .sym file. This post details my journey into the internals of these two file formats and will cover:

What a .sym file is.
What information needed to create a .sym file.
What challenges arise in extracting that information from an IDA Database.
What challenges arise in creating a .sym file with the extracted information.
How the resultant .sym file was tested.
A detailed appendix with information about the IDA Database format.

The tool created as a result of this research is available here

‍

What's a .sym?

My first challenge was working out what a VxWorks .sym file is. I wasn’t provided with any examples and could find no decisive documentation. After some searching I discovered the following from the “VxWorks Kernel Programmer Guide 6.2”:

A vxWorks.sym file is created during image building if INCLUDE_STANDALONE_SYM_TBL is not set (otherwise the symbols are bundled with the image).
objcopy somehow has the ability to create .sym files.

With this knowledge, I began searching GitHub and eventually found a repository with VxWorks build tools. This enabled me to find the link between
INCLUDE_STANDALONE_SYM_TBL and objcopy:


ifneq   ($(findstring INCLUDE_NET_SYM_TBL, $(COMPONENTS)),)
define MAKE_SYM
	$(BINXSYM) $@ $@.sym
endef
define MAKE_SYM_CVT
	$(LDOUT_SYMS) $@.sym
endef
endif


BINXSYM		= $(ENV_BIN)$(


EXTRACT_SYM_FLAG= --extract-symbol

Here is the relevant section from objcopy‘s man page:


--extract-symbol
    Keep the file's section flags and symbols but remove all section data.
    Specifically, the option: 
    *<removes the contents of all sections;>
    *<sets the size of every section to zero; and >
    *<sets the file's start address to zero.>
    This option is used to build a  .sym file for a VxWorks kernel.

This suggested that a .sym file was really just an ELF file with the non-debugging stuff removed. Now I just needed to find some examples for confirmation and reference.

I set out searching GitHub again and eventually wrote a Python script to download all files named vxworks.sym.

Script

from itertools import count
from os import environ
from pathlib import Path
from time import sleep
from urllib.request import urlretrieve
import requests
try:
    PAT = environ["GITHUB_PAT"]
except IndexError:
    print("Set the GITHUB_PAT environment variable to a personal access token from:")
    print("https://github.com/settings/tokens")
    exit(1)
DOWNLOAD_FOLDER = (Path(__file__).parent.parent / "vxworks" / "syms").resolve()
DOWNLOAD_FOLDER.mkdir(parents=True, exist_ok=True)
downloaded_hashes = set()
for page in count(1):
    results = requests.get(
        "https://api.github.com/search/code",
        headers={"Authorization": f"Token {PAT}"},
        params={"q": "filename:vxworks.sym", "per_page": "100", "page": str(page)},
    ).json()["items"]
    if len(results) == 0:
        break
    for result in results:
        if (
            result["name"].lower() == "vxworks.sym"
            and result["sha"] not in downloaded_hashes
        ):
            downloaded_hashes.add(result["sha"])
            urlretrieve(
                result["html_url"].replace("/blob/", "/raw/"),
                DOWNLOAD_FOLDER / f"""{result["sha"]}.sym""",
            )
            print(result["sha"])
    sleep(60)

After deduplication, I ended up with 64 examples. Here is the readelf -S output for one of them:


  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        00000000 0000c0 000000 00  AX  0   0 64
  [ 2] .wrs_build_vars   PROGBITS        00000000 0000c0 000000 00  WA  0   0  1
  [ 3] .sdata2           PROGBITS        00000000 0000c0 000000 00  AX  0   0  8
  [ 4] .data             PROGBITS        00000000 0000c0 000000 00  WA  0   0  8
  [ 5] .tls_data         PROGBITS        00000000 0000c0 000000 00  WA  0   0  1
  [ 6] .tls_vars         PROGBITS        00000000 0000c0 000000 00  WA  0   0  1
  [ 7] .sdata            PROGBITS        00000000 0000c0 000000 00  WA  0   0  8
  [ 8] .sbss             NOBITS          00000000 0000c0 000000 00  WA  0   0  8
  [ 9] .bss              NOBITS          00000000 0000c0 000000 00  WA  0   0 16
  [10] .boot             PROGBITS        00000000 0000c0 000000 00      0   0  1
  [11] .reset            PROGBITS        00000000 0000c0 000000 00      0   0  1
  [12] .debug_frame      PROGBITS        00000000 0000c0 000000 00      0   0  1
  [13] .PPC.EMB.apuinfo  PROGBITS        00000000 0000c0 000000 00      0   0  1
  [14] .shstrtab         STRTAB          00000000 0000c0 000090 00      0   0  1
  [15] .symtab           SYMTAB          00000000 0003f8 01e460 10     16 3271  4
  [16] .strtab           STRTAB          00000000 01e858 01fa38 00      0   0  1

As expected, all the sections are empty except for:

.shstrtab – Contains strings for section names.
.symtab – Contains the symbol table.
.strtab – Contains strings for symbol names.

‍

What to Extract?

My next task was working out what I needed to extract from the IDA Database file to create the .sym file. Luckily the ELF format is very well documented. Every symbol should have:

A name (stored in .strtab).
A value (address).
A type (either STT_OBJECT for a variable or STT_FUNC for a function).
A section reference (an index into the ELF’s section header table).

This meant that I needed to extract section information to create the section header table, along with function and variable information for the symbols themselves.

‍

IDA Database Parsing

The next step was to parse the IDA Database and extract the necessary information. This wasn’t easy since I didn’t have access to the IDA SDK, but luckily there were some open-source tools that I could use for reference.

There were two sections of the IDA Database that interested me:

NAM – Contains the addresses of names. Has a simple array structure.
ID0 – Contains almost all other useful information. Has a complex B-tree structure.

Whilst parsing NAM was simple, parsing ID0 required many steps:

Extracting the B-tree structures.
Converting the B-tree structures to a more Python-friendly format.
Writing a ranged B-tree search algorithm.
Implementing IDA’s proprietary “NetNode” system.
Writing the correct “NetNode” queries to extract the required information.

‍

.sym Creation

The final problem was the creation of the .sym file. Building a program to read and write ELF files was fairly easy thanks to extensive documentation and copious examples. The challenge came from the difference between IDA section metadata and ELF section metadata.

IDA sections have:

A name.
A start address.
A size.
Three flags: read, write and execute (having none of these indicates unknown permissions).

ELF sections have:

A name.
A start address.
A size.
Over 10 flags.
A type.

To resolve this difference I needed to make a few educated guesses:

For flags:

Start with any that can be inferred from the section name.
Add SHF_ALLOC if the section name is not recognised.
Add SHF_WRITE if the section’s write flag is set (or it has no flags set).
Add SHF_EXECINSTR if the section’s execute flag is set (or it has no flags set).

For type:

If it can be inferred from the section name, use that.
Otherwise use SHF_PROGBITS.

‍

Testing

To test I started by writing a simple C program:

#include <stdio.h>
int number = 0;
void or() {
    number |= 1;
    printf("Number: %d\n", number);
}
void shift() {
    number <<= 1;
    printf("Number: %d\n", number);
}
int main() {
    or();
    shift();
    return 0;
}

And compiled it (stripping symbols):

gcc test.c -no-pie -s -o test

Next, I had an IDA Database created for the program with the global variable number and function or labelled. I then used the tool to convert the database to a .sym file:

python -m sc -i test.i64 -s test.sym -e little --type EXEC --machine X86_64

After loading the program into GDB I ran the following:

(gdb) symbol-file test.sym 
Reading symbols from test.sym...
(No debugging symbols found in test.sym)
(gdb) b or
Breakpoint 1 at 0x401136
(gdb) r
Starting program: /mnt/hgfs/symbols-converter/vxworks/test 
Breakpoint 1, 0x0000000000401136 in ?? ()
(gdb) p (int) number
$1 = 0
(gdb) si 10
0x000000000040115f in ?? ()
(gdb) p (int) number
$2 = 1

Which shows that the .sym file can be loaded and allows debugging as expected.

‍

Extension

I ended up extending the tool by adding a few more features:

An option to include automatic (sub_) function names.
An option to input from a Ghidra XML export rather than an IDA Database.
An option to export to JSON or text rather than a .sym file.

It is available here.

‍

Appendix – IDB

Introduction

This appendix will talk about the structure of an IDA Database (.idb or .i64 file. IDA databases can contain up to six sections:

ID0 – Contains most useful information.
ID1 – Contains flags for each byte in the binary.
NAM – Contains addresses of names.
SEG – Unknown.
TIL – Contains information about data types.
ID2 – Unknown.

Only ID0 and NAM will be covered here.

Most of this information was obtained through reviewing the code of the following amazing projects:

This appendix was created with reference to databases created in IDA Pro 7.7. Integers are unsigned and little-endian unless otherwise specified.

‍

Header

Offset	Size	Field	Purpose
0x00	4	magic	Should be either IDA0, IDA1 or IDA2. IDA0 and IDA1 imply that the file has a 32-bit word size, IDA2 implies that it has a 64-bit word size.
0x06	8	id0_offset	Offset to the ID0 section from the start of the file.
0x0d	8	id1_offset	Offset to the ID1 section from the start of the file.
0x1a	4	signature	Should be 0xaabbccdd.
0x1e	2	version	Should be 6.
0x20	8	nam_offset	Offset to the NAM section from the start of the file.
0x28	8	seg_offset	Offset to the SEG section from the start of the file.
0x30	8	til_offset	Offset to the TIL section from the start of the file.
0x38	4	id0_checksum	CRC32 checksum of the ID0 section.
0x3c	4	id1_checksum	CRC32 checksum of the ID1 section.
0x40	4	nam_checksum	CRC32 checksum of the NAM section.
0x44	4	seg_checksum	CRC32 checksum of the SEG section.
0x48	4	til_checksum	CRC32 checksum of the TIL section.
0x4c	8	id2_offset	Offset to the ID2 section from the start of the file.
0x50	4	id2_checksum	CRC32 checksum of the ID2 section.

‍

Section Header

All sections have the following header. The section contents start immediately after.

Offset	Size	Field	Purpose
0x00	1	compression_method	0 for no compression. 2 for Zlib compression.
0x01	8	section_length	The length of the section (before decompression).

‍

ID0

Header

Offset	Size	Field	Purpose
0x00	4	next_free_index	The index of the next free page.
0x04	2	page_size	The number of bytes occupied by a single page.
0x06	4	root_page_index	The index of the root page.
0x0a	4	record_count	The number of non-dead records.
0x0e	4	page_count	The number of non-dead pages.
0x13	9	magic	Should be B-tree v2.

page_size - 0x1c bytes of padding follow the header.

‍

B-Tree

Introduction

The contents of ID0 are laid out as a B-tree. A B-tree is similar to a binary search tree except that each page (collection of records) may have more than two children. Each record has a key (shown in the diagram below) and a value.

‍

Page

Every page starts with the following header:

Offset	Size	Field	Purpose
0x00	4	first_page_index	The index of the first (left-most) child page. If this is 0 then the page is a leaf page, otherwise, it is an index page.
0x04	2	count	The number of records in the page.

After this, there is a count length array of record meta structures.

‍

Index Record Meta

Offset	Size	Field	Purpose
0x00	4	page_index	The index of the child page to the right of the record.
0x04	2	record_offset	The offset from the start of the page to the record.

‍

Leaf Record Meta

Offset	Size	Field	Purpose
0x00	4	indent	The number of bytes to prepend to this record’s key from the start of the last (next-left) record’s key.
0x04	2	record_offset	The offset from the start of the page to the record.

‍

Record

Each field follows immediately from the last.

Size	Field	Purpose
2	key_length	The length of the record’s key (without indent).
key_length	key	The record’s key.
2	value_length	The length of the record’s value.
value_length	value	The record’s value.

‍

Net Nodes

Introduction

Net nodes are IDA’s method of grouping records related to something (often an address). Each net node has an integer node ID. It may also have a string node ID which can be resolved to the integer node ID.

The records inside a net node are identified by a tag (single byte value) and index (4 or 8-byte value). The B-tree structure makes it efficient to find all records with a given tag.

String Node ID

String node IDs can be resolved to integer node IDs by searching the B-tree for a record with the key:

N<string_node_id>

The record’s value gives the integer node ID.

‍

Name

Some nodes have a string name. This can be found by searching the B-tree for a record with the key:

<node_idd>N

node_id is big-endian with a size matching file’s word size (see Header). The record’s value gives the name.

‍

Records

You can find a record with a specific index and tag by searching for the key:

.<node_id>N

Both node_idand index are big-endian with sizes matching the file’s word size.

All records with a given tag can be found by performing a ranged search on the B-tree.

‍

Variable Length Integers

Record values often use IDA’s proprietary variable-length integer formats.

Up to two bytes (T):

If the first byte begins with 0b11, the value is stored in the following two bytes.
Else if the first byte begins with 0b1, the value is stored in the remainder of the first byte and the following byte.
Else the value is stored in the first byte.

Up four bytes (U):

If the first byte begins with 0b111, the value is stored in the following four bytes.
Else if the first byte begins with 0b11, the value is stored in the remainder of the first byte and the following four bytes.
Else if the first byte begins with 0b1, the value is stored in the remainder of the first byte and the following byte.
Else the value is stored in the first byte.

Up to eight bytes (V) – Stored as two consecutive Us. The firstU is the lower four bytes, and the second is the upper four bytes.
Up to word size (W) – U if the word size if the file’s word size is 32-bits and V if it is 64-bits.

All are big-endian.

‍

Analysis

Types use the letters from Variable Length Integers. Upper-case means unsigned and lower-case means signed.

Segments (Sections)

Segment information is found in the $ funcs net node. Every record with the tag S has the format:

Type	Field	Description
W	start	The start address of the segment.
W	length	The length of the segment.
W	name_index	The index of the segment’s name in $ segstrings (covered later).
W	class_index	The index of the segment’s class in $ segstrings (covered later).
W	org_base	Dependant on the processor.
U	flags	Detailed here.
U	alignment_codes	Unknown.
U	combination_codes	Unknown.
U	permissions	Flags. 1 is read, 2 is write, 4 is execute. 0 means unknown flags.
U	bitness	The number of bits used for segment addressing. 0 is 16-bits, 1 is 32-bits, 2 is 64-bits.
U	type	Determines how the kernel deals with the segment.
U	selector	A unique value used to identify the segment.
U	colour	The segment’s colour. Subtract one for the RGBA value.

The $ segstrings net node has a record with tag S at index 0. This is an array of:

<string_length><string>

Where the string length is a single byte.

‍

Functions

Function information is found in the $ funcs net node. Every record with the tag S begins with:

Type	Field	Description
W	start	The start address of the function chunk.
W	length	The length of the function chunk.
T	flags	Flags. 0x8000 means this is a tail chunk (it is a head chunk otherwise).

Head chunks then have the following:

Type	Field	Description
W	frame	The node ID of the frame net node.
W	locals_size	The size of the local variables (bytes).
T	registers_size	The size of the saved registers (bytes).
W	arguments_size	The size of the stack arguments (bytes).

And tail chunks have:

Type	Field	Description
w	parent_offset	The offset to the head chunk. Subtract this from start to get the address of the head chunk.
U	referer_count	The number of referrers referencing this chunk.

Every function has one head chunk and zero or more tail chunks. The name of the head chunk is the function name.

The information documented here is only part of the function information that is available.

‍

NAM

Header

Offset 32	Offset 64	Size	Field	Purpose
0x00	0x00	4	magic	Should be VA* followed by a null byte.
0x08	0x08	4	non_empty	0 if the section is empty, 1 otherwise.
0x10	0x10	4	page_count	The number of pages (size 0x2000) occupied by the section.
0x18	0x1c	4	name_count	The number of name addresses in the section. If the file’s word size is 64-bits then this number needs to be halved.

If the file’s word size is 32-bits then 0x1fe4 bytes of padding follow the header and if it is 64-bits then 0x1fe0 bytes of padding follow it.

‍

Name Addresses

A name_count array of integers matching the file’s word size. To resolve the integers to strings see Name.

Back to Labs

Converting IDA DB to VxWorks .sym

Goal

What's a .sym?

What to Extract?

IDA Database Parsing

.sym Creation

Testing

Extension

Appendix – IDB

Introduction

Header

‍

Section Header

ID0

Header

B-Tree

Introduction

Page

Index Record Meta

Leaf Record Meta

Record

Net Nodes

Introduction

String Node ID

Name

Records

Variable Length Integers

Analysis

Segments (Sections)

Functions

NAM

Header

Name Addresses

Navigation

Contact us

Follow us

Quick Links