Title: WorstFit: Unveiling Hidden Transformers in Windows ANSI!

Posted April 10Apr 10

preview

This is a cross-post from DEVCORE. The research was first published at Black Hat Europe 2024. Personally, I would like to thank splitline, the co-author of this research article, whose help and idea were invaluable. Please also give him a big shout-out!

TL;DR

The research unveils a new attack surface in Windows by exploitingBest-Fit, an internal charset conversion feature. Through our work, we successfully transformed this feature into several practical attacks, including Path Traversal, Argument Injection, and even RCE, affecting numerous well-known applications!

Given that the root cause spas compiler behavior, C/C++ runtime and developer’s mistakes, we also discussed the challenges of pushing fixes within the open-source ecosystem.

Get the latest update and slides on our website! → https://worst.fit/

Let’s imagine that: you’re a pentester, and your target website is running the following code. Can you pop a calc.exe with that?

123?php $url='https://example.tld/' . $_GET['path'] . '.txt'; system('wget.exe -q ' . escapeshellarg($url));You can have a quick try on your own. The PHP code uses a secure way to spawn the command. Looks a bit hard, right?

Well, today, we would like to present a new technique to break through it!

Outline

Decoding the Windows EncodingsThe Early Days: ANSI and Code Pages

The Unicode Era: UTF-16

The Dual Era of Encoding

It was the Best of Fit

It was the Worst of Fit – The novel attack surface on Windows The nightmare of East-Asia - CVE-2024-4577

Filename Smuggling

Argument Splitting

Environment Variable Confusion

The Dusk–or Dawn–of the WorstFit

EpilogueMitigations

Conclusion

Decoding the Windows Encodings

If you are a Windows user, you’re probably aware that the Windows operating system supports Unicode. This means we can seamlessly put emojis , áccènted letters, 𝒻𝒶𝓃𝒸𝓎 𝕤𝕪𝕞𝕓𝕠𝕝𝕤 and CJK 号号号号号ㄒヨ方pretty much much anywhere — like file names, file contents, or even environment variables. But have you ever wondered how Windows managements to handle those non-ASCII characters?

Well, to describe this, let’s dive into the history of encoding in Windows first to understand how it handles.

The Early Days: ANSI and Code Pages

Code PageLanguage1250

Central/Eastern European languages (e.g. Polish, Czech)

1251

Cyrillic-based languages (e.g. Russian, Bulgarian)

1252

Western European languages (e.g. English, German, French)

1253

Greek

1254

Turkish

1255

Hebrew

1256

Arabic

1257

Baltic languages (e.g. Estonian, Latvian, Lithuanian)

1258

Vietnamese

932

Japanese

936

Simplified Chinese

949

Korean

950

Traditional Chinese

874

Thai

Windows initially used ANSI encoding, which relied on code pages such as the one shown above. It used 8 to 16 bits to represent a single character. While these mappings were effective for certain languages, they were unable to accommodate mixed or universal character sets.

For instance, back in the day, as a Taiwanese, if my Japanese friend sent me an article written on their Windows computer, I’d probably end up with a scrambled mess of mojibake because my code page 950 system could’t properly interpret the Japanese 932 code page.

To handle different encoding needs, Windows doesn’t rely on just one type of code page — there are actually two:

ACP(ANSI Code Page): Used for most applications and system settings, such as file operations or managing environment variables. Our research here primarily focuses on this type of code page, as it significantly impacts the scenarios we’ll exam.

OEMCP(Original Equipment Manufacturer Code Page): Mainly used for device communication, such as reading or writing to the console.

To check which ACP (ANSI code page) you’re using, consider these methods:

Using PowerShell1powershell.exe [Console]:OutputEncoding.WindowsCodePage

From the Registry1reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /v ACP

Additionally, you might also hear of chcp. However, be aware that chcp displays theOEMCPrather than the ACP, which is the focus of our research here.

The Unicode Era: UTF-16

To address the limitations of code pages, Windows transitioned to Unicode in the mid-1990s. Unlike code pages, Unicode could represent characters from nearly all languages in a single standard.

Initially, Windows used UCS-2 for Unicode but soon upgraded toUTF-16, which uses 16 bits for most characters and expands to 32 bits for rarer ones (e.g. emojis, ancient scripts). Windows also switched towide charactersfor core APIs like file systems, system information, and text processing.

Now you might be wondering: Hey, what about the most popular Unicode encoding nowadays:UTF-8? Well, it’s already there, but still in a sort of beta phase. For most languages, the UTF-8 feature sadly isn’t enabled by default.

The Dual Era of Encoding

Even though Unicode became the backbone of Windows, Windows still needs to do what they always do: backward compatible. They still need to support the old ANSI code pages. To achieve this, Windows implemented two different versions of APIs:

ANSI APIs: A Windows code page version with the letter “A” postfix used to indicate “ANSI”. For example, GetEnvironmentVariableA function.

Unicode APIs: A Unicode version with the letter “W” postfix used to indicate “wide (character)”. For example, GetEnvironmentVariableW function.

This approach allows developers to easily obtain their desired data format by simply switching between the A-postfix and W-postfix APIs.

It sounds perfect – But wait, so how can a wide character UTF-16 string also be in the ANSI format? Aren’t they fundamentally different?

To illustrate this, let’s explore an example. Imagine we’re on an English (Windows-1252code page) system with an environment variable ENV=Hello stored in the system. The data is internally stored asUTF-16(wide character format), but we can retrieve it using both Unicode and ANSI APIs:

Unicode API: GetEnvironmentVariableW(L'ENV') ⭢ L'Hello' (Hex: 4800 6500 6C00 6C00 6F00 in UTF-16LE).

ANSI API: GetEnvironmentVariableA('ENV') — RtlUnicodeStringToAnsiString ⭢ 'Hello' (Hex: 48 65 6C 6C 6F in ANSI).

For theUnicode API, there’s no problem—Unicode in, Unicode out, with no conversion needed. For theANSI API, Windows applies an implicit conversion by calling RtlUnicodeStringToAnsiString (or sometimes WideCharToMultiByte) to convert the original Unicode string to an ANSI string. Since 'Hello' consists only of ASCII characters, everything works perfectly and as expected.

But what happens if the environment variable contains a more complex string, like√π⁷≤∞, with a lot of non-ASCII characters?

Unicode API: GetEnvironmentVariableW(L'ENV') ⭢ L'√π⁷≤∞' (Hex: 1a22 c003 7720 6422 1e22 in UTF-16LE).

TheUnicode APIcorrectly returns the original string as we expected.

Now, what happens with the ANSI API? Are you able to guess the result?

ANSI API: GetEnvironmentVariableA('ENV') — RtlUnicodeStringToAnsiString ⭢ 'vp7=8' (Hex: 76 70 37 3D 38 in ANSI) 🤯

Yep, the output isvp7=8. A strange result, right? I guess you can’t even figure out the connection between the original characters and their character codes!

This bizarre transformation is what’s known as“Best-Fit”behavior. As a result, the original string √π⁷≤∞ transforms into a nonsensical 'vp7=8'. This behavior highlights the pitfalls of relying on ANSI APIs when handling non-ASCII characters.

And actually, it’s not just when using Windows APIs directly — this behavior also occurs when using non-wide-character version CRT (C runtime) functions like getenv. Surprisingly, even when you receive arguments or environment variables through a seemingly straightforward non-wide-character main function like:

12345678#include stdio.h#include stdlib.hint main(int argc, char* argv[], char* envp[]) { print('test_env=%s\n', getenv('test_env')); for (int i=0; i argc; ++i) printf('argv[%d]=%s\n', i, argv[i]);}The same Best-Fit behavior applies to both the arguments and the environment variables. Here’s what happens when we run this code:

This happens because, during compilation, the compiler inserts several functions and links the CRT DLLs for you, which internally rely on ANSI Windows APIs. As a result, the same Best-Fit behavior is triggered implicitly.

We keep talking about Best-Fit, but how does this quirky behavior actually work in the end?

It was the Best of Fit

In Windows, “Best-Fit” character conversion is a way the operating system handles situations where it needs to convert characters from UTF-16 to ANSI, but the exact character doesn’t exist in the target code page.

For instance, the ∞ (U+221E) symbol isn’t part of the Windows-1252 code page, so Microsoft decided to map it to the “closest“ character—8 (). Uh, okay. Yeah I guess they kinda look similar, but

Quote

Sign In

Title: WorstFit: Unveiling Hidden Transformers in Windows ANSI!

Featured Replies

TL;DR

Outline

Decoding the Windows Encodings

The Early Days: ANSI and Code Pages

The Unicode Era: UTF-16

The Dual Era of Encoding

It was the Best of Fit

Join the conversation

Important Information

Account

Navigation

Search