|
Vince's CSV Parser
|
CSVParserCore currently owns BOM detection through ParserChunkOptions::scan_bom. That works, but it means BOM handling is still part of the lower-level byte parser instead of the source/window orchestration layer. MmapParser and StreamParser now share CSVParserDriverBase::utf8_bom(), but the actual scan/strip behavior still happens inside the parser core used by serial and speculative parsing.
The next architectural cleanup is to make BOM handling a window-boundary concern: detect a Unicode BOM once, reject unsupported encodings early, strip a UTF-8 BOM before parsing, and feed BOM-free byte views into both serial and speculative parsers.
MmapParser and StreamParser behavior identical.CSVReader::utf8_bom() reporting behavior.CSVReader::iterator multi-pass or cache source chunks.Move BOM scan state into the common parse orchestration path, likely CSVParseOrchestrator, or a small helper owned by it.
For the first source window only:
chunk.substr(skip) to the serial or speculative parser.CSVParseWindowResult::complete_prefix_length by the skipped byte count so source adapters still advance by source-byte offsets.After that first scan, all parser-core invocations should receive ParserChunkOptions(..., false) or an equivalent path that disables core-level BOM scanning.
The key invariant is that source adapters speak in original source-byte offsets, while parser cores may see a BOM-free view.
base_offset passed into parsing should still describe the original source.RawCSVData must remain correct relative to the backing chunk view they reference.complete_prefix_length returned to source adapters must include any stripped BOM bytes, otherwise mmap/stream remainder handling will re-read or retain the wrong prefix.Add focused tests for both constructor paths using Catch2 SECTIONs:
CSVReader::utf8_bom() is true after reading UTF-8 BOM input.stream_pos_, mmap_pos, or leftover_.scan_bom plumbing only after tests prove it is no longer needed by serial or speculative paths.include/internal/ARCHITECTURE.md if ownership of BOM handling moves from CSVParserCore to the orchestrator.