Low-level languages

Machine Code

Each CPU has a specific instruction set.

  • Associated with rules regarding structure, and execution flow.

When a program is compiled to “binary”, the high-level logic is converted to a sequence of instructions.

  • This sequence may be executed by a family of CPUs or a single model.

  • Running this sequence on another CPU may involve binary translation (conversion).

Humans are typically not capable of reading binary instructions, but instructions are always able to be translated into Assembly.

  • Good: We can read binary code.

  • Bad: each CPU has a specific variant of Assembly. Also, assembly is not simple.

For compiled programs, the RE tasks involve extracting information from the sequence of Assembly instructions.

  • Disassembly is automatic, the rest frequently it isn’t.

Reconstruction is never perfect!

  • Different levels of abstraction: e.g., it is not trivial to recover C++ class structure and OOP relations from Assembly code.

  • Different compilers generate different assembly for the same source code.

  • The same compiler may generate different assembly for the same source code.

    • Optimization flags, CPU matching, protection mechanisms, target object type…

Bytecode

Some languages are compiled into a bytecode (!= machine code).

  • Intermediate language that is processed by a VM or framework.

  • .NET, Java, Python, JS, LISP, LUA, Ocaml, Tcl, FoxPro, WebAssembly.

Bytecode contains a compact (optimized) representation of the higher layer structures.

  • Framework/VM will execute bytecode in the target CPU.

  • The same bytecode usually can be executed in multiple CPUs, provided there is a native VM implementation.

    • The Java moto: Write Once, Run Anywhere.

Bytecode allows easier extraction of information, provided there is such a route.

  • May recover classes, function names, and even comments (but not always).

  • Traditional decompiling tools will not process bytecode (that easily).

Last updated