Записи в блоге

№	Пользователь	Рейтинг
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	strapple	3515
6	tourist	3470
7	dXqwq	3436
8	Radewoosh	3415
9	Otomachi_Una	3413
10	Um_nik	3376

№	Пользователь	Вклад
1	Qingyu	157
2	adamant	153
3	Um_nik	146
3	Proof_by_QED	146
5	Dominater069	145
6	errorgorn	141
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133

Блог пользователя jamjury

Memory-mapped stdout

Автор jamjury, история, 7 месяцев назад, По-английски

In my last blog post I wrote about mapping stdin into memory. This time we're going to be mapping stdout.

Firstly, we need to get a handle to stdout. That's easy enough with Win32 API GetStdHandle:

HANDLE Stdout = GetStdHandle(STD_OUTPUT_HANDLE);

Alternatively we could directly use NT API NtCurrentTeb, which GetStdHandle uses under the hood, but it's a little more lengthy:

HANDLE Stdout = NtCurrentTeb()->ProcessEnvironmentBlock->ProcessParameters->StdOutputHandle;

Next we need to map our file into memory. Windows (or any other platform) doesn't support write-only pages, so we have to map into read-write pages. Let's try to follow what we did for stdin (CreateFileMapping):

HANDLE FileMapping = CreateFileMapping(
    Stdout,          // [in] File
    NULL,            // [in] FileMappingAttributes [optional]
    PAGE_READWRITE,  // [in] Protect
    0,    /* one  */ // [in] MaximumSizeHigh
    4096, /* page */ // [in] MaximumSizeLow
    NULL             // [in] Name [optional]
);

Unfortunately, we get Access is denied, because stdout is not readable. What can we do about it?.. We can reopen the file with read access!

First, let's get the name of the output file. We can do it with Win32 API GetFinalPathNameByHandleW. We can't get VOLUME_NAME_DOS or VOLUME_NAME_GUID, so let's use VOLUME_NAME_NT:

WCHAR FileName[MAX_PATH];
DWORD ret = GetFinalPathNameByHandleW(Stdout, FileName, MAX_PATH, VOLUME_NAME_NT);

Again we could instead directly call NT API that GetFinalPathNameByHandleW uses under the hood — NtQueryObject:

struct { UNICODE_STRING Name; WCHAR Buffer[MAX_PATH]; } FileNameBuffer;
NtQueryObject(Stdout, ObjectNameInformation, &FileNameBuffer, sizeof(FileNameBuffer), NULL);

As a result we get something like this: \Device\ImDisk1\54345f194c63c344b90cbfcd5944cfd3\run-a6c9a90c30e45e273a5ca250be6b2efb\run\output.fd0138e687.txt

This means that CodeForces uses ImDisk and creates with it ramdisk to store output files to speed up writing. Nice!

Now we need to open this file. Win32 API function for this is CreateFile, which, despite the name, also opens files:

HANDLE File = CreateFileW(
    FileName,                           // [in] lpFileName
    GENERIC_READ | GENERIC_WRITE,       // [in] dwDesiredAccess
    FILE_SHARE_READ | FILE_SHARE_WRITE, // [in] dwShareMode
    NULL,                               // [in] lpSecurityAttribute [optional]
    OPEN_EXISTING,                      // [in] dwCreationDisposition
    0,                                  // [in] dwFlagsAndAttributes
    NULL                                // [in] hTemplateFile [optional]
);

But if we do it just like that, we get: The system cannot find the path specified. What we have to do instead is replace "\Device" in the beginning of FileName with "\\?" (in C string you'd use escapes: "\\\\?"). Alternatively we can use the underlying NT API NtOpenFile, which accepts NT object names directly:

HANDLE File;
OBJECT_ATTRIBUTES ObjectAttributes;
IO_STATUS_BLOCK IoStatusBlock;

InitializeObjectAttributes(&ObjectAttributes, &FileNameBuffer.Name, 0, NULL, NULL);
NtOpenFile(
   &File,                              // [out] FileHandle
   FILE_READ_DATA | FILE_WRITE_DATA,   // [in]  DesiredAccess
   &ObjectAttributes,                  // [in]  ObjectAttributes
   &IoStatusBlock,                     // [out] IoStatusBlock
   FILE_SHARE_READ | FILE_SHARE_WRITE, // [in]  ShareAccess
   0                                   // [in]  OpenOptions
);

If we open file without FILE_SHARE_READ | FILE_SHARE_WRITE we get STATUS_SHARING_VIOLATION.

Now that we've opened the file for both reading and writing we can map it and write to it without any issues.

const char* msg = "Hello from memory-mapped stdout\n";
size_t len = strlen(msg);
HANDLE map = CreateFileMapping(File, NULL, PAGE_READWRITE, 0, len, NULL);
void *view = MapViewOfFile(map, FILE_MAP_WRITE, 0, 0, 0);
memcpy(view, msg, len);

Except we get a funny bug from CodeForces runner.

Stack trace

com.codeforces.contester.exception.ProcessorException: Unexpected error.
com.codeforces.contester.processor.impl.CustomTestSubmitProcessor.internalProcess(CustomTestSubmitProcessor.java:73)
com.codeforces.contester.processor.impl.CustomTestSubmitProcessor.process(CustomTestSubmitProcessor.java:46)
com.codeforces.contester.processor.impl.CustomTestSubmitProcessor$$$$EnhancerByGuice$$$$37943460.GUICE$TRAMPOLINE(<generated>)
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:74)
com.codeforces.contester.ioc.ContesterModule$ExceptionHandlingInterceptor.invoke(ContesterModule.java:103)
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:75)
com.google.inject.internal.InterceptorStackCallback.invoke(InterceptorStackCallback.java:55)
com.codeforces.contester.processor.impl.CustomTestSubmitProcessor$$$$EnhancerByGuice$$$$37943460.process(<generated>)
com.codeforces.contester.processor.impl.CustomTestSubmitProcessor$$$$EnhancerByGuice$$$$37943460.GUICE$TRAMPOLINE(<generated>)
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:74)
com.codeforces.contester.ioc.ContesterModule$ExceptionHandlingInterceptor.invoke(ContesterModule.java:103)
com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:75)
com.google.inject.internal.InterceptorStackCallback.invoke(InterceptorStackCallback.java:55)
com.codeforces.contester.processor.impl.CustomTestSubmitProcessor$$$$EnhancerByGuice$$$$37943460.process(<generated>)
com.codeforces.contester.processor.RunnableFactory$ProcessorRunnable.run(RunnableFactory.java:102)
com.codeforces.contester.ContesterRunnable.lambda$internalRun$3(ContesterRunnable.java:293)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
java.base/java.lang.Thread.run(Thread.java:833)

This also tells us that CodeForces runner is written in Java — lol.

After some digging around, we find that CodeForces runner really doesn't like string "PAGE_READWRITE" present anywhere in the code. So we just substitute this macro by hand with value 4.

A-a-and we finally get...

Hello from memory-mapped stdout

That's it!

Next time we'll use some other NT APIs to not just map statically-sized output file, but also extend it while it's mapped — for this we'll need MEM_RESERVE, which is only present in Windows. We'll also pack it all into a neat class, like we did last time with input. See you there!

P.S.

Note that above only works in the CodeForces runner. It won't work if you're running it from cmd with e.g. main > output.txt. The reason being that cmd opens stdout in exclusive mode and we get The process cannot access the file because it is being used by another process when trying to open it second time. What we can do is first close stdout in parent process by duplicating it with DUPLICATE_CLOSE_SOURCE and then everything starts to work again.

PROCESS_BASIC_INFORMATION pbi;
HANDLE ParentProcess;
ULONG_PTR ParentProcessId;
HANDLE DupedFile;

NtQueryInformationProcess(GetCurrentProcess(), ProcessBasicInformation, &pbi, sizeof(pbi), NULL);
ParentProcessId = pbi.InheritedFromUniqueProcessId;
ParentProcess = OpenProcess(PROCESS_DUP_HANDLE, FALSE, ParentProcessId);
DuplicateHandle(ParentProcess, Stdout, GetCurrentProcess(), &DupedFile, 0, FALSE, DUPLICATE_CLOSE_SOURCE);

Полный текст и комментарии »

memory manipulation, input-output

jamjury
7 месяцев назад
0

Memory-mapped i/o

Автор jamjury, история, 23 месяца назад, По-английски

Memory-mapped input file is the fastest possible input method I know of. I used it for last ICPC Challenge.

Unfortunately, it is not possible — without exploiting Windows kernel bugs — to also map output file into memory on CF, because the handle lacks GENERIC_READ, which is required for FILE_MAP_WRITE mapping. So the best we can do for output is to write the whole buffer at once (but it doesn't really speed things up).

#include <Windows.h> // GetStdHandle, GetFileSize, CreateFileMapping, MapViewOfFile, UnmapViewOfFile, WriteFile, CloseHandle
#include <charconv> // from_chars, to_chars
#include <cstring> // memcpy
#include <cstdlib> // malloc, free
#include <cstddef> // size_t

// #define FREE_IO_ON_EXIT
 
struct input {
    LPVOID view;
    const char *first, *last;
 
    input() noexcept {
        HANDLE input_handle = GetStdHandle(STD_INPUT_HANDLE);
        DWORD input_size = GetFileSize(
            input_handle, // [in]            HANDLE  hFile,
            NULL          // [out, optional] LPDWORD lpFileSizeHigh
        );
        HANDLE mapping_object = CreateFileMapping(
            input_handle,  // [in]           HANDLE                hFile
            NULL,          // [in, optional] LPSECURITY_ATTRIBUTES lpFileMappingAttributes
            PAGE_READONLY, // [in]           DWORD                 flProtect
            0, /* whole */ // [in]           DWORD                 dwMaximumSizeHigh
            0, /*  file */ // [in]           DWORD                 dwMaximumSizeLow
            NULL           // [in, optional] LPCSTR                lpName
        );
        view = MapViewOfFile(
            mapping_object,  // [in] HANDLE hFileMappingObject
            FILE_MAP_READ,   // [in] DWORD  dwDesiredAccess
            0,               // [in] DWORD  dwFileOffsetHigh
            0,               // [in] DWORD  dwFileOffsetLow
            0 /*whole file*/ // [in] SIZE_T dwNumberOfBytesToMap
        );

        first = (char*) view;
        last = first + input_size;
        #ifdef FREE_IO_ON_EXIT
            CloseHandle(input_handle);
            CloseHandle(mapping_object);
        #endif
    }
 
    int take_int() noexcept {
        int result;
        first = std::from_chars(first, last, result).ptr + 1;
        return result;
    }
 
    double take_double() noexcept {
        double result;
        first = std::from_chars(first, last, result).ptr + 1;
        return result;
    }
 
    #ifdef FREE_IO_ON_EXIT
        ~input() noexcept {
            UnmapViewOfFile(view);
        }
    #endif
} in;

struct output {
    constexpr static std::size_t buf_size = 32*1024*1024; // 32MB
    char *buf, *first, *last;
 
    output() noexcept
        : buf((char*) std::malloc(buf_size))
        , first(buf)
        , last(buf + buf_size)
    {}
 
    void put_int(int value) noexcept {
        first = std::to_chars(first, last, value).ptr;
    }
 
    template <std::size_t size>
    void put_str(const char (&str)[size]) noexcept {
        std::memcpy(first, str, size - 1);
        first += size - 1;
    }
 
    void put_char(char c) noexcept {
        *first++ = c;
    }
 
    ~output() noexcept {
        HANDLE output_handle = GetStdHandle(STD_OUTPUT_HANDLE);
        WriteFile(output_handle, buf, first - buf, NULL, NULL);
        #ifdef FREE_IO_ON_EXIT
            CloseHandle(output_handle);
            std::free(buf);
        #endif
    }
} out;

int main() {
    out.put_int(in.take_int());
}

The benchmark (webarchive) on my machine gcc 13.2.0 (MinGW-W64), compiled with -std=c++20 -O2

int, scanf         4.01   3.93   4.05
int, cin           9.31   0.64   9.21
int, mmap in       0.06   0.06   0.06

int, printf        2.25   2.27   2.22
int, cout          0.39   0.40   0.40
int, mmap out      0.39   0.38   0.39

*mmap in/out is this implementation

Полный текст и комментарии »

input-output, performance, c++, memory manipulation

jamjury
23 месяца назад
0

Huawei half-precision floating-point range

Автор jamjury, история, 23 месяца назад, По-английски

In the problem from currently running Huawei contest (Accuracy-Preserving Summation Algorithm) a part of the task is to choose between IEEE-754 binary64, binary32 and binary16 floating point formats to use for number summation.

Apparently, people in charge of the contest don't know that both Intel and AMD support fp16 since 2011-2012 (AMD Bulldozer and Intel Ivy Bridge), it's supported by GCC 12+ and Clang 15+ as _Float16 and since C++23 as std::float16_t.

So for the checker they wrote their own implementation, which differs from IEEE in two places:

the exponent range is [-16; +16] instead of [-15; +15];
during conversion from fp64 least significant bits are just thrown away instead of being rounded.

Considering it's importance in the problem, I decided to check what's the range of fp64 that doesn't overflow when converted to fp16 and the difference that aforementioned differences make. Therefore I wrote a little program which prints the ranges and decided to share results here with anyone interested. It shows ranges for both "Huawei FP16", IEEE and "Corrected Huawei" — without rounding, but with correct exponent range.

Here are the results:

Huawei FP16
  MAX
    fp64: 131071.99999999999 (0x1.fffffffffffffp+16)
    fp16: 131008 (0x1.ffcp+16)
  OVERFLOW
    fp64: 131072 (0x1p+17)
    fp16: inf
  RANGE
    fp64: (-131072; 131072)
    fp16: [-131008; 131008]

IEEE-754 FP16
  MAX
    fp64: 65519.999999999993 (0x1.ffdffffffffffp+15)
    fp16: 65504 (0x1.ffcp+15)
  OVERFLOW
    fp64: 65520 (0x1.ffep+15)
    fp16: inf
  RANGE
    fp64: (-65520; 65520)
    fp16: [-65504; 65504]

Huawei FP16 with correct range
  MAX
    fp64: 65535.999999999993 (0x1.fffffffffffffp+15)
    fp16: 65504 (0x1.ffcp+15)
  OVERFLOW
    fp64: 65536 (0x1p+16)
    fp16: inf
  RANGE
    fp64: (-65536; 65536)
    fp16: [-65504; 65504]

https://godbolt.org/z/3s8GoKe8j

Source

#include <iostream> // cout
#include <cstdint>  // uint32_t, uint64_t
#include <cstring>  // memcpy
#include <cmath>    // nextafter

using namespace std;

//simulated fp16
class Float16 {
    static const uint32_t mantissaShift = 42;
    static const uint32_t expShiftMid   = 56;
    static const uint32_t expShiftOut   = 52;
    double dValue_;

public:
    Float16(double in) : dValue_(in) {
        uint64_t utmp;
        memcpy(&utmp, &dValue_, sizeof utmp);
        //zeroing mantissa bits starting from 11th (this is NOT rounding)
        utmp = utmp >> mantissaShift;
        utmp = utmp << mantissaShift;
        //setting masks for 5-bit exponent extraction out of 11-bit one
        const uint64_t maskExpMid = (63llu << expShiftMid);
        const uint64_t maskExpOut = (15llu << expShiftOut);
        const uint64_t maskExpLead = (1llu << 62);
        const uint64_t maskMantissaD = (1llu << 63) + maskExpLead + maskExpMid + maskExpOut;
        if (utmp & maskExpLead) {// checking leading bit, suspect overflow
            if (utmp & maskExpMid) { //Detected overflow if at least 1 bit is non-zero
                //Assign Inf with proper sign
                utmp = utmp | maskExpMid; //setting 1s in the middle 6 bits of of exponent
                utmp = utmp & maskMantissaD; //zeroing mantissa irrelative of original values to prevent NaN
                utmp = utmp | maskExpOut; //setting 1s in the last 4 bits of exponent
            }
        } else { //checking small numbers according to exponent range
            if ((utmp & maskExpMid) != maskExpMid) { //Detected underflow if at least 1 bit is 0
                utmp = 0;
            }
        }
        memcpy(&dValue_, &utmp, sizeof utmp);
    }

    explicit operator double() { return dValue_; }
};


class CorrectFloat16 {
    static const uint32_t mantissaShift = 42;
    static const uint32_t expShiftMid   = 56;
    static const uint32_t expShiftOut   = 52;
    double dValue_;

public:
    CorrectFloat16(double in) : dValue_(in) {
        uint64_t utmp;
        memcpy(&utmp, &dValue_, sizeof utmp);
        utmp = utmp >> mantissaShift;
        utmp = utmp << mantissaShift;
        const uint64_t maskExpMid = (63llu << expShiftMid);
        const uint64_t maskExpOut = (15llu << expShiftOut);
        const uint64_t maskExpLead = (1llu << 62);
        const uint64_t maskMantissaD = (1llu << 63) + maskExpLead + maskExpMid + maskExpOut;
        if (utmp & maskExpLead) {
            if (utmp & maskExpMid || (utmp & maskExpOut) == maskExpOut) {                   // <- Changed here
                utmp = utmp | maskExpMid;
                utmp = utmp & maskMantissaD;
                utmp = utmp | maskExpOut;
            }
        } else {
            if ((utmp & maskExpMid) != maskExpMid) {
                utmp = 0;
            }
        }
        memcpy(&dValue_, &utmp, sizeof utmp);
    }

    explicit operator double() { return dValue_; }
};


#if __GNUC__ >= 13 && __cplusplus >= 202100L
    #include <stdfloat> // float16_t
#else
    typedef _Float16 float16_t; // GCC >= 12 || Clang >= 15
#endif

int main() {
    cout.precision(17);

    double SFP16_MAX = 0x1p17 - 0x1p-36; // 131071.99999999999;
    double SFP16_INF = nextafter(SFP16_MAX, INFINITY);
    cout << "Huawei FP16\n";
    cout << "  MAX\n";
    cout << "    fp64: " << defaultfloat << SFP16_MAX << " (" << hexfloat << SFP16_MAX << ")\n"
         << "    fp16: " << defaultfloat << double(Float16(SFP16_MAX)) << " (" << hexfloat << double(Float16(SFP16_MAX)) << ")\n";
    cout << "  OVERFLOW\n";
    cout << "    fp64: " << defaultfloat << SFP16_INF << " (" << hexfloat << SFP16_INF << ")\n"
         << "    fp16: " << defaultfloat << double(Float16(SFP16_INF)) << "\n";
    cout << "  RANGE\n";
    cout << "    fp64: (" << -SFP16_INF << "; " << SFP16_INF << ")\n"
         << "    fp16: [" << double(Float16(-SFP16_MAX)) << "; " << double(Float16(SFP16_MAX)) << "]\n";

    cout << "\n";

    double FP16_MAX = 0x1p16 - 0x1p-37 - 0x1p4; // 65519.99999999999;
    double FP16_INF = nextafter(FP16_MAX, INFINITY);
    cout << "IEEE-754 FP16\n";
    cout << "  MAX\n";
    cout << "    fp64: " << defaultfloat << FP16_MAX << " (" << hexfloat << FP16_MAX << ")\n"
         << "    fp16: " << defaultfloat << double(float16_t(FP16_MAX)) << " (" << hexfloat << double(float16_t(FP16_MAX)) << ")\n";
    cout << "  OVERFLOW\n";
    cout << "    fp64: " << defaultfloat << FP16_INF << " (" << hexfloat << FP16_INF << ")\n"
         << "    fp16: " << defaultfloat << double(float16_t(FP16_INF)) << "\n";
    cout << "  RANGE\n";
    cout << "    fp64: (" << -FP16_INF << "; " << FP16_INF << ")\n"
         << "    fp16: [" << double(float16_t(-FP16_MAX)) << "; " << double(float16_t(FP16_MAX)) << "]\n";

    cout << "\n";

    double CSFP16_MAX = 0x1p16 - 0x1p-37; // 65535.99999999999;
    double CSFP16_INF = nextafter(CSFP16_MAX, INFINITY);
    cout << "Huawei FP16 with correct range\n";
    cout << "  MAX\n";
    cout << "    fp64: " << defaultfloat << CSFP16_MAX << " (" << hexfloat << CSFP16_MAX << ")\n"
         << "    fp16: " << defaultfloat << double(CorrectFloat16(CSFP16_MAX)) << " (" << hexfloat << double(CorrectFloat16(CSFP16_MAX)) << ")\n";
    cout << "  OVERFLOW\n";
    cout << "    fp64: " << defaultfloat << CSFP16_INF << " (" << hexfloat << CSFP16_INF << ")\n"
         << "    fp16: " << defaultfloat << double(CorrectFloat16(CSFP16_INF)) << "\n";
    cout << "  RANGE\n";
    cout << "    fp64: (" << -CSFP16_INF << "; " << CSFP16_INF << ")\n"
         << "    fp16: [" << double(CorrectFloat16(-CSFP16_MAX)) << "; " << double(CorrectFloat16(CSFP16_MAX)) << "]\n";
}

I think it's kinda sloppy that the problem statement is inconsistent with the checker, but at least they've shared the checker's code.

Полный текст и комментарии »

icpcchallenge, huawei, floating number

jamjury
23 месяца назад
0