Skip to content

[C++][Parquet] Saturate ApplicationVersion components instead of atoi UB#50013

Open
rootvector2 wants to merge 1 commit into
apache:mainfrom
rootvector2:parquet-version-atoi-overflow
Open

[C++][Parquet] Saturate ApplicationVersion components instead of atoi UB#50013
rootvector2 wants to merge 1 commit into
apache:mainfrom
rootvector2:parquet-version-atoi-overflow

Conversation

@rootvector2
Copy link
Copy Markdown

What

When reading a Parquet file, ApplicationVersion::ApplicationVersion(const std::string& created_by) parses the writer-supplied created_by metadata string into three integer fields (version.major, version.minor, version.patch). The current implementation in cpp/src/parquet/metadata.cc does this with std::atoi:

application_version_.version.major = atoi(version_major_string.c_str());  // line 1444
application_version_.version.minor = atoi(version_minor_string.c_str());  // line 1469
application_version_.version.patch = atoi(version_patch_string.c_str());  // line 1487

The substrings here are already filtered to be digit-only, but their length is not bounded. The created_by field is fully attacker-controlled in any untrusted Parquet file, so an input like

parquet-mr version 99999999999999999999.0.0 (build abcd)

reaches atoi with a value far outside the range of int. Per the C++ standard, std::atoi has undefined behavior when the result cannot be represented as an int ([c.strtoint]). In practice the value is unspecified, and the negative or otherwise garbage value that comes out then drives VersionLt / HasCorrectStatistics, controlling whether the reader chooses to trust the column statistics block of the file.

How to observe

A small reproducer in the same C++17 mode as Arrow:

#include <cstdio>
#include <cstdlib>
#include <string>
int main() {
    std::string s = "99999999999999999999";
    std::printf("atoi -> %d\n", std::atoi(s.c_str())); // UB; observed -1 here
}

UBSan flags the overflow inside the underlying strtol. The existing tests in metadata_test.cc only cover small inputs, which is why this has not been caught.

The fix

Replace the three atoi call sites with a small ParseUnsignedVersionComponent helper inside the anonymous namespace in metadata.cc. The helper uses std::strtoul (which is defined to set errno = ERANGE and return ULONG_MAX on overflow rather than invoking UB) and saturates to std::numeric_limits<int>::max() when the parsed value would not fit in int. The fix is local to the callee – callers (the constructor of ApplicationVersion, called from FileMetaData's thrift deserializer) are unchanged.

A new ApplicationVersion.VersionComponentOverflow test exercises:

  • very long all-digit components (saturate to INT_MAX),
  • exactly INT_MAX (representable),
  • INT_MAX + 1 (saturates).

Build/test evidence

Built parquet_objlib and parquet-internals-test locally; all 18 ApplicationVersion.* tests pass, including the new overflow case:

[----------] 18 tests from ApplicationVersion (0 ms total)
[  PASSED  ] 18 tests.

The version components in a Parquet file's `created_by` string are
attacker-controlled. std::atoi exhibits undefined behavior when the
parsed value overflows int (see [c.strtoint]). Replace the three
atoi calls in ApplicationVersionParser with a saturating helper that
uses std::strtoul and clamps to INT_MAX when the parsed value would
not fit in int. Add a regression test exercising overflow inputs.
@rootvector2 rootvector2 requested a review from wgtmac as a code owner May 21, 2026 12:36
@github-actions
Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant