Last updated at Sat, 19 Aug 2017 17:00:24 GMT

Shellcode is an exercise in trade-offs.

To be really flexible and fit in the most exploits, shellcode must be small.  On the other side of the scale, there are certain features that you need or want, each adding to the size. For instance, doing DNS resolution in the first stage payload is useful, but (in Windows) requires adding 80 bytes to the stager. So we have to balance size, which is very important for compatibility with some exploits that have limited buffers to work with, with features and reliability, which are important for world domination.

Metasploit's existing stagers are usually small enough, and our encoders get rid of bad characters for you, so it doesn't make sense to spend a lot of time writing your own payloads or optimizing to get rid of pesky bytes. Sometimes, though, a few bytes can make a big difference.  For example, the exploit for MS08-067, ms08_067_netapi.rb has a buffer size of 400 bytes; we'll come back to this in a moment.

block_api

block_api is a brilliant piece of kit that rummages through MZ headers looking for the function we want to call so that finding a pointer to, say, "InternetOpenA" which is portable and reliable on all versions of Windows. Because all of our Windows payloads use it, a win here can make all our Windows payloads a little bit smaller.

The first win comes from the fact that x86 has several ways of addressing memory. This is the original code:

   add eax, edx           ; Add the modules base address
   mov eax, [eax+120]     ; Get export tables RVA

The mov instruction here is using the "mov reg, r/m" form, which allows us to do some simple math on a register (adding 120) without having to store the result. The "r/m" argument is a little more flexible, though. It was intended to be able to reference tables and can take "[ base scale*index disp32 ]", where base would normally be the beginning of the table, scale would be the size of its elements, index would be the index of the element you want, and disp32 the offset within that element.

Armed with this knowledge, both of the above instructions can be condensed into one:

  mov eax, [eax+edx+120]    ; Get export tables RVA

for a one-byte saving. Every byte is sacred, after all.

The second reduction comes from the fact that the designers of x86 intended the ECX register to be used as a loop counter and thus added several instructions that treat it specially. jecxz is one that allows us to jump without having to explicitly test a register. By using ECX for our EAT pointer instead of EAX, we can turn this:

  test eax, eax          ; Test if no export address table is present
  jz get_next_mod1       ; If no EAT present, process the next module

into this:

  jecxz get_next_mod1    ; If no EAT present, process the next module

saving another 2 bytes.

reverse_http(s)

That's all great for improving your shellcode golf score, but the big win came in the reverse_http and reverse_https payloads.

The first thing I noticed about reverse_http is that it used the time-honored tradition of jmp'ing ahead, then call'ing backwards and popping the return address to get the address of a string on the stack (in this case, it was the hostname and URI to callback to). Then it would store that value in a register and later push it as an argument to a function. Since the call instruction already puts the value on the stack, I simply rearranged it to be in the argument setup instead of beforehand, e.g. something like this pseudocode:

  jmp get_uri
got_uri:
  pop ebx
  push eax
  push eax
  push ebx
  call ...
get_uri:
  call got_uri
  db "/12345", 0x00

became this:

  push eax
  push eax
  jmp get_uri
got_uri:
  call ...
get_uri:
  call got_uri
  db "/12345", 0x00

If we have an instruction like this:

  mov esi, eax

and we don't need eax to keep its value (in this case we don't), we can replace this it with

  xchg esi, eax

and save another byte.

The next savings came from a need for zeros. In almost every function call made in reverse_http(s), we need a zero (or a NULL) for at least one argument.  In fact, we need zero so often that it makes sense to just save it in a register and use it over and over. Previously this was ad-hoc, done at the beginning of each function with a different register. By zeroing one register at the beginning and using it throughout the payload, we can save even more space.

Before this change, reverse_https was 368 bytes unencoded. Encoding with x86/call4_dword_xor knocks it up to 392 bytes. With x86/jmp_call_additive, it is 397 bytes. With the added stuff that ms08_067_netapi needs for fixing up the stack, it comes to 404 and 409, respectively. If you'll recall, that's too large for ms08_067_netapi. The encoders that produce decoding stubs with less overhead also produce badchars for this exploit, so shikata_ga_nai encoding (for instance) will not work.

After this change, reverse_https is 350 bytes unencoded. With all of ms08_067_netapi's restrictions, we can now encode it to a size that will fit. And there was much rejoicing.

Happy hacking.