Binary Analysis Process

Up to now, we know how ELF files are structured, but the question remains: how do we analyse ELF files?

A possible flow can be:

File analysis (file, nm, ldd, content visualization, foremost, binwalk).
Static Analysis (disassemblers and decompilers).
Behavioural Analysis (strace, LD_PRELOAD).
Dynamic Analysis (debuggers).

Identifying a file

Files should be seen as containers (this includes ELF files).

May have the expected content type.
- But it may have unexpected behaviour (e.g. bug or malware).
May have unexpected, additional content (e.g. polyglots).
- More common in DRM schemes and malware to hide binary blobs.

Files should not be trusted.

Both the expected and additional content may be malicious.
Static analysis is safe (as long as nothing is executed).
Dynamic analysis is not safe. Sandboxes and VMs must be used.

Questions to answer

What type of file do we have?
- Are there hidden contents?
What is the architecture?
Is it 64/32 or ARM7/ARM9/ARM9E/ARM10?
Where is the starting address?
What does the main function do?
What will the program do?

Some basic tools go a long way.

file: (try to identify) the type of file.
- Only applies to a top container. file is not able to look into enclosed binary blobs.
- Alternatives that complement file are binwalk and foremost.
xxd: hexdump the file, allowing for rapidly detecting patterns.
- less also helps to hold the content in the terminal.
strings: prints null-terminated sequence chars.
- By default, with more than 4 characters (-n setting).
ldd: print shared object dependencies.
- Libraries registered in the ELF are required (typically for dynamically relocated symbols).
nm: dumps symbols from .symtab (or .dyntab with –D).

Disassembler Basics `ghidra`

ghidra is an open-source tool developed by the NSA and released to the public doing Disassembly and Static Analysis.
- The development branch has support for Dynamic Analysis (should be released “soon”).
Works on Windows, Linux and macOS (Java-based).
Not the most important tool (IDA is), but is gaining huge traction. It’s free and very powerful with a huge number of platforms and a fine decompiler.

CFGs

It is useful to think of machine code in a graph structure, called a control-flow graph.

A node in a CFG is a group of adjacent instructions called a basic block:

The only jumps into a basic block are to the first instruction.
The only jumps out of a basic block are from the last instruction.
I.e., a basic block always executes as a unit.

Edges between blocks represent possible jumps.

Basic block a dominates basic block b if every path to b passes through a first. Strictly dominates if a != b.

Basic block b post-dominates a if every path through a also passes through b later.

Disassembly

The disassembly process involves analyzing the binary and converting binary code to assembly.

But “binary” is just a sequence of bytes, that must be mapped in the scope of a given architecture.
Conversion depends on many factors, including compiler and flags.

The process is not perfect and may induce RE Analysts in error.

Present instructions that do not exist.
Ignore instructions that are in the binary code.

Main approaches:

Linear Disassembly.
Recursive Disassembly.

Linear Disassembly

The simplest approach towards analyzing a program: Iterate over all code segments, disassembling the binary code as opcodes are found.

Start at some address and follow the binary.

Entry point or other point in the binary file.
The entry point may not be known.

Works best with:

binary blobs such as from firmware (start at the beginning).
objects which do not have data at the beginning.
architecture uses variable length instructions (x86).

It is vital to define the initial address for decompiling.

An offset error will result in invalid or wrong instructions being decoded.

Linear disassembly will also try to disassemble data from the binary as if it were actual code.

Linear Disassembly is oblivious to the actual Program Flow.

With x86, because each opcode has a variable length, the code tends to auto-synchronize, but the first instructions will be missed.

Issues

With ELF files in x86, linear disassembly tends to be useful.

Compilers do not emit inline data and the process rapidly synchronizes.
Still, padding and alignment efforts may create some wrong instructions.

With PE files, compilers may emit inline data and Linear Disassembly is not adequate.

Every time data is found, disassembly becomes desynchronized.

Other architectures (ARM) and binary objects usually are not suited for Linear Disassembly.

Obfuscation may include code as data, which is loaded dynamically.
Fixed-length instruction sets will not easily synchronize.

So why is it useful?

Code in the binary blob may be executed with a dynamic call.

Some JMP/CALL with an address computed dynamically and unknown to the static analyzer.

Linear Disassembly will decompile everything:

whether or not it is called - May be useful to uncover hidden program code.
even if the binary blob is not a structured executable – Boot sector, firmware.

Readily available with simple tools: objdump and gdb.

Gdb memory dump (x/i) will also use Linear Disassembly.

Recursive Disassembly

A more complex approach that disassembles code from an initial point, while following the control flow.

That is: follows jmp, call and ret.

As long as the start point is correct, or it synchronizes rapidly, flow can be fully recovered.

This is the standard process for more complex tools such as ghidra and IDA.

Goes around inline data as no instruction will exist that will make the program execute at such an address.

Well… control flow can easily be forged with ((void ()(int, char)) ptr)().

PreviousExample NextFunction detection

Last updated 5 months ago