Skip to content

A Beginner's Guide to Fracture

Fredidly edited this page Dec 6, 2014 · 10 revisions

PAGE IN PROGRESS

This document will give you a step-by-step overview of the process by which Fracture decompiles a binary. It will also provide links to the important files, functions, variables, etc. in each step for those wanting a deeper understanding.

Step 1 - Important LLVM Classes

Important Code:

Description:

In order to decompile into IR, Fracture must utilize some important LLVM classes. These classes are how LLVM represents code at various levels of abstraction. We will not go into much detail, but it is important to have a basic understanding of each. From lowest-level to highest-level they are as follows:

  • MCInst: Represents a single machine instruction. Contains the target-specific opcode and operand list for the instruction. LLVM can convert an MCInst directly into machine code.
  • MachineInstr: Represents a single machine instruction. Similar to the MCInst, contains the target-specific opcode and operand list for the instruction. However, it contains additional information as well, such as an instruction descriptor which stores attributes of the instruction.
  • MachineBasicBlock: Represents a basic block of machine instructions. The high level definition of a basic block is a segment of code with only one entry point and one exit point. For us, that means a basic block is sequence of instructions without any branches or jumps, except for the last instruction. The last instruction in a block can be a branch or jump. A MachineBasicBlock contains a list of MachineInstr's. In a well formed MachineBasicBlock, this list will conform to the rules of a basic block.
  • MachineFunction: Represents a function made up of machine instructions. Contains a list of MachineBasicBlock's. Also contains all the target-specific information required to compile to that target.
  • SelectionDAG: Graphical representation of a basic block.
  • BasicBlock: Represents a basic block of IR instructions. See MachineBasicBlock for the definition of a basic block.
  • Function:
  • Module:

Step 2 - Initialization and Command Line Parsing

Important Code:

Description:

Fracture begins by calling some LLVM initialization functions. It then uses an LLVM command line parsing function to get the input file name, as well as handle any options specified. Next, Fracture's internal commands are initialized using initializeCommands.

At this point, Fracture loads the input file into the global variableTempExectuable using the loadBinary function. loadBinary also initializes the global MCDirector, Disassembler, and Decompiler, which we will come to later.

Now the main initialization phase is over. main passes control off to the runShell function of the Commands class. runShell continually prompts the user for input, processes that input, and calls the appropriate command functions from fracture-cl.cpp. Note: the Commands class enables piping in the Fracture command line using the '>', '<', and '|' operators.

Step 3 - Disassembly

Important Code:

Description:

The first step in the trip from binary to IR is disassembly. Fracture disassembles one function at a time using disassemble. The Funtions member variable keeps track of all the functions that have been disassembled so far. If the current function has already been encountered, simply return it. If not, create an empty MachineFunction and add it to Funtions.

The disassembly process begins by using the binary object file that was stored in TempExecutable during the initialization step. LLVM has a built in disassembler class called MCDisassembler. This class is capable of being fed instructions in binary and producing MCInst objects that correctly represent those instructions. So, starting at the beginning of the function, Fracture begins to pump the bits stored in TempExecutable into MCDisassembler.

As MCDisassembler produces MCInsts, each one is manually converted into a MachineInstr. These MachineInstr's are added, in order, to a MachineBasicBlock. This process continues until a terminating branch instruction is added to the MachineBasicBlock. By definition, this means the end of the basic block has been reached. So you now have a complete representation of a basic block stored in LLVM objects.

Fracture then adds this MachineBasicBlock to the MachineFuntion that it created in the beginning of this step. It checks the last instruction of the basic block. If it is a return instruction, then you know the end of the function has been reached. If not, then starting from the location directly following the end of the basic block, build up another MachineBasicBlock and add it to the MachineFunction. Continue this procedure until that return instruction is found (or an error condition is hit?).

Once that return is hit, you now have a complete, target-specific representation of an entire function stored in LLVM objects. This MachineFunction is returned and the disassembly process is complete.

Step 4 - What is a SelectionDAG?

Important Code:

Description:

At this point we must delve a little deeper into one of the LLVM classes. Namely, the SelectionDAG. DAGs are Directed Acyclic Graphs. They look like this. In LLVM, they look like this (should have a picture of a SelectionDAG here). A SelectionDAG is a representation of a basic block in LLVM. It consists of SDNodes. There are a few relevant types of SDNodes used in a SelectionDAG:

  1. Instruction nodes have a list of operands on the top, and result values on the bottom. The name of the instruction is in the middle. In Fracture, the operands will either be an immediate value, a value copied from a register, or the chain. The result values will either be the chain or a value to be copied to a register. The idea is that the given instruction will be performed on the non-chain operands, and any return values will be placed in the non-chain results (to be copied to a register).
  2. CopyFromReg (CFR) nodes have two operands: the chain and a register. They have two results: a value and the chain. They denote that the value of the given register is being copied and placed into the result.
  3. CopyToReg (C2R) nodes have three operands: the chain, a register, and a value. They have one result: the chain. They denote that the value of the given register is being replaced with the value operand.
  4. Register nodes are only used by the CFR nodes and the C2R nodes. They indicate which register the operation is using. There is one register node for each individual register used in the SelectionDAG. If multiple CFRs and C2Rs use the same register, they will all point to the same node.
  5. The EntryToken node is the first node in the graph and represents the start of the basic block. There is only one EntryToken per SelectionDAG.
  6. The GraphRoot node is the last node in the graph and represents the end of the basic block. There is only one GraphRoot per SelectrionDAG.

The chain represents the flow of control. Starting from the EntryToken, the chain can be followed all the way down to the GraphRoot. This is the order in which the instructions must execute for the program to work correctly. The instructions that do not involve the chain can be executed in any order, as long as all of their operands have executed in the right order.

In LLVM, a SelectionDAG is built from IR code, and used to select the optimal target instructions to implement the IR (hence SelectionDAG). In Fracture, the SelectionDAG will be built from the target instructions and used to generate IR.

Step 5 - Create the Machine DAG

Important Code:

Description:

As you recall from step 3, we are now in possession of a well formed MachineFunction. This MachineFunction contains a list of MachineBasicBlocks. We must convert each of these MachineBasicBlocks into a SelectionDAG. To handle this conversion, there is an appropriately named function called createDAGfromMachineBasicBlock.

How much detail should I go into here?

Step 6 - Create the IR DAG

Important Code:

Description: