feat(string): add UTF-8 string conversion and validation functions#2528
feat(string): add UTF-8 string conversion and validation functions#2528bobtista wants to merge 4 commits intoTheSuperHackers:mainfrom
Conversation
|
| Filename | Overview |
|---|---|
| Core/Libraries/Source/WWVegas/WWLib/utf8.cpp | New UTF-8 conversion and validation utility; overlong encoding checks are present, but Utf8_Validate does not reject UTF-16 surrogate halves and Utf8_Num_Bytes uses single-line if bodies contrary to the codebase style rule. |
| Core/Libraries/Source/WWVegas/WWLib/utf8.h | New public header; uses #pragma once correctly, API is well-documented, naming follows Snake_Case convention, dest/src argument order matches memcpy. |
| Core/GameEngine/Source/Common/System/AsciiString.cpp | translate() correctly uses Utf16Le_To_Utf8_Len + ensureUniqueBufferOfSize(len+1) + Utf16Le_To_Utf8(buf, len+1, …); the conversion function guarantees a null terminator since written < destLen. The null-terminator placement fix in ensureUniqueBufferOfSize (moved outside the strToCopy guard) is a correct companion fix. |
| Core/GameEngine/Source/Common/System/UnicodeString.cpp | Symmetric to AsciiString; translate() correctly allocates len+1 wchar_t elements and relies on Utf8_To_Utf16Le to null-terminate. No issues found. |
| Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp | Correctly migrated to WWLib helpers; std::wstring/std::string manage their own null terminator at data()[size()], so passing ret.resize(len) and then calling the conversion with destLen=len is safe even though the helper skips writing its own null terminator when written==destLen. |
| Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt | utf8.cpp/utf8.h added to the unconditional WWLIB_SRC block; the developer confirmed in a prior review thread that the #error non-Windows guard is intentional and cross-platform support is a deferred TODO, so no action needed here. |
Prompt To Fix All With AI
This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Line: 32-38
Comment:
**Single-line `if` bodies violate codebase style rule**
Each `if` body should appear on a separate line from its condition to allow precise debugger breakpoint placement.
```suggestion
size_t Utf8_Num_Bytes(char lead)
{
if ((lead & 0x80) == 0x00)
return 1;
if ((lead & 0xE0) == 0xC0)
return 2;
if ((lead & 0xF0) == 0xE0)
return 3;
if ((lead & 0xF8) == 0xF0)
return 4;
return 0;
}
```
**Rule Used:** Always place if/else/for/while statement bodies on... ([source](https://app.greptile.com/review/custom-context?memory=16b9b669-b823-49be-ba5b-2bd30ff3ba6d))
**Learnt From**
[TheSuperHackers/GeneralsGameCode#2067](https://github.com/TheSuperHackers/GeneralsGameCode/pull/2067#discussion_r2706274626)
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Line: 83-93
Comment:
**`Utf8_Validate` accepts UTF-16 surrogate codepoints**
RFC 3629 §3 explicitly forbids encoding surrogate pairs (U+D800–U+DFFF) in UTF-8, but the current validator has no check for them. A 3-byte sequence with lead byte `0xED` and second byte in `[0xA0, 0xBF]` (e.g. `\xED\xA0\x80` = U+D800) passes all existing checks — it's not caught by the E0 overlong guard since the lead is ED, not E0. Any downstream code that relies on `Utf8_Validate` to screen player names or other network input could receive lone-surrogate UTF-8 and encounter undefined behaviour in Win32 `MultiByteToWideChar`.
Add a guard for the surrogate block after the existing 3-byte checks:
```cpp
// Reject UTF-16 surrogate halves (U+D800–U+DFFF)
if (bytes == 3 && s[i] == 0xED && s[i + 1] >= 0xA0)
return false;
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (10): Last reviewed commit: "refactor(utf8): rename to Utf16Le_To_Utf..." | Re-trigger Greptile
|
| DEBUG_LOG(("ParseAsciiStringToGameInfo - slotValue name is empty, quitting")); | ||
| break; | ||
| } | ||
| // TheSuperHackers @fix bobtista 02/04/2026 Validate UTF-8 encoding before processing player name |
There was a problem hiding this comment.
This appears to be beyond the scope of this change. It is not describes in the title. Perhaps is a separate change?
xezon
left a comment
There was a problem hiding this comment.
Get_Utf8_Size should not include the null terminator in its size.
| if (dest_size == 0) | ||
| return; | ||
| return false; | ||
| int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size); |
There was a problem hiding this comment.
What happens if dest_size does not have enough room for a null terminator?
There was a problem hiding this comment.
The doc says "Does not write a null terminator" - should we add more comments? Change the functions to always null-terminate? What do you want here?
There was a problem hiding this comment.
Maybe make it behave like strncpy? Writes null if there is room, otherwise not.
The only issue with the current interface then is that we will not know if it wrote the null terminator. Maybe it should return size_t instead, returning the number of characters it writes or would like to write? MultiByteToWideChar also does that.
I suggest to think this through and design the function interface in a way that it can be conveniently be used for fixed size strings (std::string, AsciiString) and large throwaway buffers (char arr[512]).
The behavior definitely needs to be documented.
| return result != 0; | ||
| } | ||
|
|
||
| bool Utf8_To_Unicode(wchar_t* dest, const char* src, size_t srcLen, size_t dest_size) |
There was a problem hiding this comment.
Is the naming choice for src len and dest size deliberate?
There was a problem hiding this comment.
What I meant is one says len and the other says size. Is this intended? I would expect to use len or size consistently, unless one refers to number of characters and the other to number of bytes.
|
The diff now shows unrelated changes. |
39d7229 to
40393b8
Compare
Try again cleaned up the commits and force pushed |
| } | ||
| ensureUniqueBufferOfSize((Int)size + 1, false, nullptr, nullptr); | ||
| char* buf = peek(); | ||
| if (!Unicode_To_Utf8(buf, src, srcLen, size)) |
There was a problem hiding this comment.
So is this translating UTF16LE from windows into UTF8 that is then stored within AsciiString?
If so this may help with the paths issue with usernames and paths not using Latin characters, but file handling functions will need updating to use unicode variants instead of Ascii.
|
I wonder if we should also add a flag to state that the Ascii string is holding a UTF8 string? I guess all normal ascii characters will display properly, it's just extended character sets that will look garbled. |
…ming, null-terminate when room
| utf8.cpp | ||
| utf8.h |
There was a problem hiding this comment.
utf8.cpp unconditionally listed outside the if(WIN32) block
utf8.cpp's entire implementation is wrapped in #ifdef _WIN32 … #else #error "Not implemented" #endif, and the callers (AsciiString.cpp, UnicodeString.cpp, ThreadUtils.cpp) now unconditionally include utf8.h and call its functions. Any non-Windows build will hit the #error immediately. Other Win32-specific sources (registry.cpp, verchk.cpp, etc.) are correctly placed inside the if(WIN32) block below — utf8.cpp should follow the same pattern.
# Remove from unconditional WWLIB_SRC and add to the existing if(WIN32) block:
if(WIN32)
list(APPEND WWLIB_SRC
...
utf8.cpp
utf8.h
...
)
endif()A proper cross-platform implementation (or #ifdef guards in the callers) would be needed before removing the #error.
Prompt To Fix With AI
This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt
Line: 136-137
Comment:
**`utf8.cpp` unconditionally listed outside the `if(WIN32)` block**
`utf8.cpp`'s entire implementation is wrapped in `#ifdef _WIN32 … #else #error "Not implemented" #endif`, and the callers (`AsciiString.cpp`, `UnicodeString.cpp`, `ThreadUtils.cpp`) now unconditionally include `utf8.h` and call its functions. Any non-Windows build will hit the `#error` immediately. Other Win32-specific sources (`registry.cpp`, `verchk.cpp`, etc.) are correctly placed inside the `if(WIN32)` block below — `utf8.cpp` should follow the same pattern.
```cmake
# Remove from unconditional WWLIB_SRC and add to the existing if(WIN32) block:
if(WIN32)
list(APPEND WWLIB_SRC
...
utf8.cpp
utf8.h
...
)
endif()
```
A proper cross-platform implementation (or `#ifdef` guards in the callers) would be needed before removing the `#error`.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
I don't think we need to change this - cross-platform implementation is a todo and the #error is intentional as a placeholder
|
Slight nitpick on function naming, shouldn't it be Utf16 rather than Wchar as Wchar is only Utf16 on windows yet we will likely want the conversion functions on other platforms as well for at least csf parsing if nothing else. |
Yeah, but it's consistent with the other naming as it is for now. How about we keep the naming and using a uint16_t/char16_t type internally rather than wchar_t when we make non windows paths? Or would you rather we rename to something like Utf16_To_Utf8? |
If anything it would be Utf16Le_To_Utf8, windows uses the little endian utf16 format. Not sure if Utf16Be is used much anywhere but worth being concise with it. |
Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.
Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.
New:
WWLib/utf8.h/utf8.cppUtf8_Num_Bytes(char lead)— byte count of a UTF-8 character from its lead byteUtf8_Trailing_Invalid_Bytes(const char* str, size_t length)— count of invalid trailing bytes due to an incomplete multi-byte sequenceUtf8_Validate(const char* str)/Utf8_Validate(const char* str, size_t length)— returns true if the string is valid UTF-8 per RFC 3629 (rejects overlong encodings and codepoints above U+10FFFF)Utf16Le_To_Utf8_Len(const wchar_t* src, size_t srcLen)/Utf8_To_Utf16Le_Len(const char* src, size_t srcLen)— required output size, not counting null terminatorUtf16Le_To_Utf8(char* dest, size_t destLen, const wchar_t* src, size_t srcLen)Utf8_To_Utf16Le(wchar_t* dest, size_t destLen, const char* src, size_t srcLen)Naming follows the
Snake_Caseconvention used in WWVegas. The conversion functions return the number of units required: if the return is<= destLenthe conversion was written (with a null terminator if room remains); if> destLenthe buffer was too small and the return value tells the caller how much to allocate;0indicates a conversion failure. Implementation is Windows-only and treatswchar_tas UTF-16LE, wrapping Win32WideCharToMultiByte/MultiByteToWideChar.AsciiString::translate/UnicodeString::translateReplaces the broken implementations that only worked for 7-bit ASCII (marked
@todosince the original code) with proper UTF-8 conversion using the new WWLib functions.ThreadUtils.cppReplaces raw Win32 API calls in
MultiByteToWideCharSingleLineandWideCharStringToMultiBytewith the new WWLib functions, usingstd::string::resize/std::wstring::resizeto avoid duplicate allocation.