Last updated at Thu, 04 Nov 2021 19:47:45 GMT

What is this thing?

Researchers at the University of Cambridge and the University of Edinburgh recently published a paper on an attack technique they call “Trojan Source.” The attack targets a weakness in text-encoding standard Unicode—which allows computers to handle text across many different languages—to trick compilers into emitting binaries that do not actually match the logic visible in source code. In other words, what a developer or security analyst sees in source code with their own eyes could be different from how a compiler interprets it—leading, in effect, to an attack that is not easily discernible. This weakness arises from Unicode’s bidirectional “BiDi” algorithm and affects most compilers, or perhaps more accurately, most editing and code review tooling; the idea that source code will be compiled the way it is displayed to the human eye is a fundamental assumption.

How the attack works.

It is possible, and often necessary, to have both left-to-right and right-to-left glyphs appear in the same sentence. A classic example from O’Reilly’s “Unicode Explained” book shows Arabic embedded in an English sentence and the direction readers familiar with both languages will read the section in:

The official Unicode site also has additional information and examples.

There are a few options available to creators when the need for a document or section of a document to support bidirectional content, one of which is to insert “invisible” control characters that dictate the directionality of text following the directive. This is how the “Trojan Source” attack works. Let’s use one of the examples from the paper to illustrate what’s going on.

The screenshot above is from the GitHub repository associated with the paper and shows the C language source code that looks like it should not print anything when compiled and run. (Also note that there is a very explicit safety banner, which you should absolutely take very seriously in any source code you see it displayed in).

When we copy that code from the browser and paste it into the popular Sublime Text editor with the Gremlins package installed and enabled, we can see the attempted shenanigans pretty clearly:

The line number sidebar shows where sneaky directives have been inserted, and the usually invisible content is explicitly highlighted and not interpreted, so you can see what’s actually getting compiled. In this case, one is always “admin” when they run this program. The bottom line is that you cannot fully trust just your eyes without some assistance.

Note that cat Linux command (available on Windows via the Windows Subsystem for Linux and via macOS by installing the GNU version of the utility) can also be used to display these invisible gremlins:

cat -A -v commentint-out.c                                                  #include <stdio.h>$
#include <stdbool.h>$
int main() {$
    bool isAdmin = false;$
    /*M-bM-^@M-. } M-bM-^AM-&if (isAdmin)M-bM-^AM-) M-bM-^AM-& begin admins only */$
        printf("You are an admin.\n");$
    /* end admins only M-bM-^@M-. { M-bM-^AM-&*/$
    return 0;$

Unfortunately, GitHub’s safety banner and code-editor plugins do not scale very well. Thankfully, Red Hat has come to the rescue with a simple Python script which can help us identify potential issues across an entire codebase with relative ease. It should also be possible to use this script in pre-commit hooks or in CI/CD workflows to prevent malicious code from entering into production.

CVSSv3 9.8?! Orly?!

While this isn’t really a “vulnerability” in the traditional sense of the word, it’s been assigned CVE-2021-42574 and given a “Critical” CVSSv3 score of 9.8. (The “PetitPotam” attack chain targeting Windows domains is another example of a technique that was recently assigned a CVE.) It’s a little puzzling why CVE-2021-42574 merited a “Critical” severity score, though. According to our calculations, this weakness should be more like a 5.6 on the CVSSv3 scale.

Should I be super scared?

It’s an interesting attack, and its universality is certainly attention-grabbing. With that said, there are some caveats to both novelty and exploitability. Attack techniques that leverage Unicode’s text expression aren’t new. The CVSS score assigned to this is overblown. To exploit this weakness, an attacker would need to have direct access to developers’ workstations, source code management system, or CI pipelines. If an attacker has direct access to your source code management system, frankly, you probably have bigger problems than this attack. Note that said “attacker” could be a legitimate, malicious insider; those types of attackers are notoriously difficult to fully defend against.

What should I do?

You should apply patches from vendors whose products you rely on just as you normally would, keeping in mind that because this flaw is present in so many tooling implementations, you could apply many patches and still be considered “vulnerable” in other implementations. The better thing to do would be to apply a fairly straightforward mitigation: Disallow BiDi directives in your code base if you're writing in only English or only Arabic.

As noted above, you should absolutely heed the Unicode safety warnings (if available) in any source code repositories you use, and strongly consider using something like the aforementioned Red Hat Unicode directionality directive checker-script in source code control and continuous integration and deployment workflows.

We advise prioritizing truly critical patches and limiting service and system exposure before worrying about source code-level attacks that require local or physical access.


Get the latest stories, expertise, and news about security today.