Last updated at Wed, 03 Mar 2021 16:18:01 GMT

Like most binary analysis tools, an emphasis is placed on the generation of graphs that describe the behavior and relationships of different aspects of a program, function, or component. For instance, a Control Flow Graph (CFG) describes the paths that could be taken from an entry node (such as a function's entry point) to a terminator node (such as where a function returns).

For the purpose of this post, we'll use the following disassembly of a function that was taken from an executable:

000000B9 push ebp
000000BA mov ebp,esp
000000BC push ecx
000000BD mov eax,[ebp 0x8]
000000C0 mov cl,[eax]
000000C2 mov [ebp-0x4],cl
000000C5 mov edx,[ebp 0x8]
000000C8 movsx eax,byte [edx]
000000CB cmp eax,0x63
000000D0 jnz 0xdd
000000D2 mov cl,[ebp-0x4]
000000D5 add cl,0x5
000000D8 mov [ebp-0x4],cl
000000DB jmp short 0xfe
000000DD mov edx,[ebp 0x8]
000000E0 movsx eax,byte [edx]
000000E3 cmp eax,0x64
000000E8 jnz 0xf5
000000EA mov cl,[ebp-0x4]
000000ED add cl,0xa
000000F0 mov [ebp-0x4],cl
000000F3 jmp short 0xfe
000000F5 mov dl,[ebp-0x4]
000000F8 add dl,0x1
000000FB mov [ebp-0x4],dl
000000FE movsx eax,byte [ebp-0x4]
00000102 mov esp,ebp
00000104 pop ebp
00000105 ret

Aside from the standard CFG, it is also possible to generate many other graphs that help describe the behavior of this example function. For instance, one might be interested in understanding how data propagates within the context of a function or a set of functions. This would be useful in cases where you want to see what areas of code come in contact with data that you control, either directly or indirectly. One of the most obvious applications for this is fuzzing, where you can immediately identify the areas of code that are affected by a buffer that you introduce. This type of graph will be referred to as a Data Dependency Graph (DDG). It will show how memory reads and writes are affected both in terms of what address they read from or write to and in terms of what values they read or write. Here's a DDG that was generated for this example function:




In this graph, the blue edges represent a value dependency. This means that the instruction derives the value that is written to a location in memory from the instruction that the edge points to. The red edges represent an address dependency. This means that the instruction derives the address it is reading from or writing to from the instruction that the edge points to. As we can see from the graph, all of the instructions either directly or indirectly depend on the value assigned to ebp for the address that they operate on in the case of this function.

Another type of graph that may be useful is a Code Dependency Graph (CDG). In this graph, the execution order dependencies of instructions can be seen such that it's obvious that instruction X must execute before instruction Y for a specific code path. These dependencies are determined by looking at which instruction was last to affect a register, an area of pseudo-virtual memory, the stack, or a processor flag that a given instruction depends on. For instance, if a function has an xor eax, eax followed by an inc eax, it's obvious that the inc must follow the xor because the xor was the last to modify the value of eax. This simple logic can be used to determine execution order dependencies. One of the ways in which this type of graph can be useful is when researching things like instruction re-ordering for the purpose of generating code that has an equivalent behavior but different form (also referred to as polymorphism). A great example of where a graph like this was used was during the development of the Shikata Ga Nai encoder in the Metasploit Framework. For this example, though, a graph that describes the code dependencies of the example function is shown below:




In this graph, the blue edges represent a register dependency. This means that the dependent instruction expects to derive the value of a specific register from the instruction pointed to by the edge. The red edges represent a memory dependency. This means that the instruction depends on a specific portion of pseudo-virtual memory being set by the instruction pointed to by the edge. The green edges represent a processor flag dependency. This means that the dependent function is expected to derive the state of the processor flag from the instruction it depends on. What all this means, in the end, is that instructions can be moved around in any order, so long as they do not invalidate the dependencies of another instruction by doing so.

Again, these types of graphs have been described and used before by others, so don't take this post as to claim that they are something original. Instead, it's simply meant to illustrate the types of things that MSRT will be designed to support. In fact, the code responsible for collecting the data that is used to generate these graphs is encapsulated in two separate analysis modules, both of which currently total around 100 lines of code each. With that said, the library still has a long way to go.