The SDK I'm writing, in an ideal world, would be C++14 compliant rather than require C++17, but there's just so many nice things added in C++17 that are hard to give up - one of which is std::filesystem for handling paths in an agnostic fashion.

So to deal with this, we use ghc::filesystem - a C++14 compliant backport of the std::filesystem namespace. It's really quite nice, we no longer have to worry about a whole bunch of string parsing or dealing with platform-specific path separators or even things like checking if a file exists.

filesystem::path tmpPath("/some/path");
tmpPath /= "somefile.txt";

std::cout << tmpPath.c_str() ; // prints /some/path/somefile.txt
if (filesystem::exists(tmpPath))
{
  //do something to read in the file here
}

Everything's been working so far, until we start wrapping the SDK in an Unreal Engine 4 plugin, for our partners that want to access the SDK via blueprint.

Tests work just fine in the editor, and in a bunch of our standalone builds. But when we delivered a pre-release alpha build to one of our partners, we started to get some really strange behaviour when they tried to run it for the first time:

LogSDK: [11:04:00:967ms][Trace][File] subpath <redacted>\SDKName\S-1-5-21-1508652904-4122140339-2986375464-1001 skipped because it exists

Hang on, though, this is the very first run! How can the folder exist yet?

So we ask the client to create the missing folder structure and try again, and sure enough after manually creating the folders everything works. So why does the SDK think that a file exists when it clearly doesn't?

Digging into the implementation of ghc::filesystem::exists we learn that exists creates a file_status object containing attributes about a directory or file, it's type, and size and so on, and then immediately checks to see if it's type is not_found:

GHC_INLINE bool exists(file_status s) noexcept
{
    return status_known(s) && s.type() != file_type::not_found;
}

GHC_INLINE bool exists(const path& p)
{
    return exists(status(p));
}

So we needed to dig a little bit further into the implementation of status. This function eventually calls filesystem::detail::status_ex to construct the file_status object. Of course, the calls to actually inspect the files attributes are platform specific, but on Windows and derivatives the relevant part of this function looks like the following:

WIN32_FILE_ATTRIBUTE_DATA attr;
if (!::GetFileAttributesExW(p.wstring().c_str(), GetFileExInfoStandard, &attr)) {
    ec = detail::make_system_error();
}
// later on
if (ec) {
        if (detail::is_not_found_error(ec)) {
            return file_status(file_type::not_found);
        }
        return file_status(file_type::none);
}

return detail::status_from_INFO(p, &attr, ec, sz, lwt);

So basically, the implementation calls GetFileAttributesExW and then if that fails, stores the value of GetLastError into the std::error_code ec inside detail::make_system_error. Then, further on into the function, if there was an error, if that error indicates that the file or directory could not be found, create a file_status with the type set to file_type::not_found, or if it was some other error, indicate that by setting the type to file_type::none.

To better inspect this, we inserted some additional logging into the filesystem header so we could see the program flow without needing to remotely debug the client application.

However, the logs we got as a result, seemed to indicate that GetFileAttributesExW was failing on a directory that didn't exist (as it should), but with an error code of 0! The error code being clear, meant that status_from_INFO was being run on an invalid set of attributes and that the SDK was incorrectly skipping the creation of folders in a filepath, leading to errors later on when trying to read a file in a folder that didn't exist:


LogSDK: [15:15:10:866ms][Trace][File] status_ex !::GetFileAttributesExW fails with error 0

LogSDK: [15:15:10:866ms][Trace][File] status_ex reading status from info

LogSDK: [15:15:10:866ms][Trace][File] subpath <redacted>\S-1-5-21-1508652904-4122140339-2986375464-1001 skipped because it exists

LogSDK: [15:15:12:250ms][Error][File] Read from <redacted>\S-1-5-21-1508652904-4122140339-2986375464-1001\user.json failed, error = 6

The documentation for GetFileAttributesExW states that :

  • It returns 0 on failure or non-zero on success
  • It calls SetLastError internally, so you can check a more specific error code with GetLastError if it fails

So, if GetFileAttributesExW returns 0, GetLastError should give us a non-zero value, not 0! What gives?

At this point I thought I'd better start tracing through GetFileAttributesExW to see what was going on. Here's the relevant part of that implementation:

00007FFD55DC7191  | call qword ptr ds:[<&RtlDosPathNameToRelativeNtPathName_U_WithStatus>]  
                                    
00007FFD55DC71E5  | call qword ptr ds:[<&ZwQueryFullAttributesFile>]                        

00007FFD55DC7279  | call <kernelbase.BaseSetLastNTError>                                    

Single-stepping through this, I could see that, in actual fact, if the file didn't exist, BaseSetLastNTError was in fact correctly calling SetLastError with the desired non-zero error code.

So something else was clearing the value before we could put it into our std::error_code. I kept tracing through the code after the return from GetFileAttributesExW, trying to identify what function was mutating the value of GetLastError (for those of you using VS to debug something similar, you can use $err as a watch expression).

Sure enough, there was a function that set the value back to 0 (ERROR_SUCCESS).

00007FF68FC0DE64  | call <void __cdecl std::basic_string<wchar_t>::_Tidy_deallocate(void) __ptr64>

The deallocator for a string changing the value of GetLastError? Whaat?

Stepping into this function shows a call to operator delete:

00007FF68FA29082  | call <void __cdecl operator delete(void * __ptr64,unsigned __int64>

then into FMemory::Free:

00007FF6906671E3  | call <static void __cdecl FMemory::Free(void * __ptr64)>

Further inside Free we see references to the current global allocator instance that UE4 is using, and the call to TlsGetValue:

00007FF690728209  | mov ecx,dword ptr ds:[<public: static unsigned int FMallocBinned2::Binned2TlsSlot>]

00007FF690728217  | call qword ptr ds:[<&TlsGetValueStub>]

Astute observers will notice this in the documentation for TlsGetValue:

If the function fails, the return value is zero. To get extended error information, call GetLastError.

The data stored in a TLS slot can have a value of 0 because it still has its initial value or because the thread called the TlsSetValue function with 0. Therefore, if the return value is 0, you must check whether GetLastError returns ERROR_SUCCESS before determining that the function has failed. If GetLastError returns ERROR_SUCCESS, then the function has succeeded and the data stored in the TLS slot is 0. Otherwise, the function has failed.

Functions that return indications of failure call SetLastErrorwhen they fail. They generally do not call SetLastErrorwhen they succeed. The TlsGetValue function is an exception to this general rule. The TlsGetValue function calls SetLastErrorto clear a thread's last error when it succeeds. That allows checking for the error-free retrieval of zero values.

So, the UE4 binned memory allocator clobbers the result of GetLastError in any deallocations. But why are we getting a deallocation here? Recall the line invoking GetFileAttributesExW:

if (!::GetFileAttributesExW(p.wstring().c_str(), GetFileExInfoStandard, &attr)) {

The call to wstring() creates a temporary object when it converts the path's internal representation into a UTF16 wide string suitable for passing into win32 functions.

With thanks to cppreference :

All temporary objects are destroyed as the last step in evaluating the full-expression that (lexically) contains the point where they were created...

In this instance, the full-expression is at the ) in the if statement.

So the temporary created by wstring() gets destroyed at that ), ie after the function has evaluated, but before any of the statements inside the if have executed. It then clobbers GetLastError value in the deallocator, causing the value stored in the error code to be incorrectly 0 (indicating no error) instead.

What's particularly pernicious about this type of bug, though, is that line-by-line source debugging won't show it unless you 'step into' every call. If execution is paused at this if statement and you 'step over', the debugger will step over both the call to GetFileAttributesExW and the call to the destructor/deallocator for the string at the same time, making it look like GetFileAttributesExW is setting the value of GetLastError to 0.

Why does this happen in the build on the client's machine, though and not in our own testing?

This is due to the fact that Unreal supports a number of different allocators, and it selects a default based on a number of factors such as the build configuration, are you building game binaries or an editor, etc. Take a look at WindowsPlatformMemory.cpp if you want to see the actual logic for selecting the default allocator.

So how do we work around this?

The simplest thing would be to have the UE4 binned allocator push and pop the value of GetLastError such that it isn't modified by the allocator, but

  • that's the sort of change that would take a while to get upstreamed, and our client isnt using a fork of the engine at the moment
  • it doesn't protect the SDK from use in other situations where the allocator has been modified to exhibit the same behaviour

As a result, in practice, the simplest change that can be made on the SDK side, is to prolong the lifetime of the temporary by making it no longer a temporary - store the result of wstring() in a variable and then call c_str() in the call. The resulting deallocation occurs at the end of the outer scope rather than just after the evaluation of the function :

auto tmpString = p.wstring();
if (!::GetFileAttributesExW(tmpString.c_str(), GetFileExInfoStandard, &attr)) {
    ec = detail::make_system_error();
}

This has the advantage of simplicity, but also means things can be quite tedious. Everywhere that the filesystem namespace (and indeed the SDK as a whole) uses the value of GetLastError in this way, changes have to be made to ensure that nothing is deallocated before GetLastError can be called.

Also, what about functions that take multiple string parameters? It gets rather unwieldy to have to 'lift' all of those temporaries into the enclosing scope.

Alternatively, another way to ensure that the temporary objects live outside of the scope of the if is to wrap calls to the win32 APIs in another function, effectively forcing them to wait for that enclosing function to return before the destructor and deallocator are invoked. Inside the wrapper we can safely get the value of GetLastError:

template <typename Fn, Fn fn, typename... Args>
typename std::result_of<Fn(Args...)>::type wrapper(std::error_code& ec, Args... args)
{
    typename std::result_of<Fn(Args...)>::type RetVal = fn(std::forward<Args>(args)...);
    ec = make_system_error(GetLastError());
    return RetVal;
}
#define PROTECT_LASTERROR(FUNC) ghc::filesystem::detail::wrapper<decltype(&FUNC), &FUNC>

The wrapper function template deduces the arguments of the wrapped function into parameter pack Args, and uses std::result_of as a helper to retrieve the return value type (so we can store it while we check GetLastError()).

wrapper takes an error_code by reference as the first parameter, so we can store the GetLastError value into it and have it be visible in the calling scope, and lastly the PROTECT_LASTERROR macro simplifies use of the wrapper itself:

std::error_code ec;
if (PROTECT_LASTERROR(GetFileAttributesExW)(ec, detail::fromUtf8<std::wstring>(p.u8string()).c_str(), GetFileExInfoStandard, &attr))
    {
        ec.clear();
        // ...

    }

There is a caveat to this approach: because the wrapper is agnostic to the success or failure of the wrapped function, it always stores GetLastError() in ec. This means that we need to explicitly re-clear the value in the enclosing scope if the wrapped function was successful, because most functions don't clear GetLastError on a success and we don't want to leave a stale error value in ec.

However, even with this limitation, it eliminates the need for promoting temporaries to automatic variables in the enclosing scope manually, and makes it clear that we're having to do something here to explicitly guard the value of GetLastError.