|
Vince's CSV Parser
|
This document follows one field from source bytes to user-facing CSVField. It is intentionally a narrative debugging guide, not a component inventory. Use it when a parsed value is wrong and you need to know where the bytes could have changed shape.
A CSV field is usually not copied while parsing.
MmapParser or StreamParser) provides a byte window.CSVParseOrchestrator chooses serial parsing or speculative parallel parsing.CSVParserCore walks bytes and records field boundaries in RawCSVFieldList.CSVRow stores a shared pointer to RawCSVData.CSVRow slices either the original bytes or parser-realized quoted-field storage when the user asks for a field.CSVField performs scalar classification/conversion only when the user asks for it.Speculative parsing changes how rows become safe to release. It does not change the public CSVRow / CSVField ownership model.
The source-adapter layer lives in include/internal/parser/ and csv::internals::parser. The speculative-only helpers live in include/internal/speculative/ and csv::internals::speculative.
There are two source paths:
MmapParserRawCSVData so CSVRow field slices do not dangle.StreamParserstd::istream constructors and by the filename constructor on Emscripten.std::string windows.leftover_ and prepends them to the next window.Both paths feed byte windows into the same orchestrator/parser core. Bugs may still exist in only one path because source ownership, window construction, and remainder handling are different.
CSVReader does not parse bytes directly. It asks CSVReadScheduler to run a read cycle.
CSVFormat::threading(false), the scheduler runs the same read work synchronously on the caller thread.CSV_ENABLE_THREADS=0, the scheduler is always synchronous.Runtime threading opt-out also disables speculative parallel parsing for that reader. The row and field ownership model is otherwise unchanged.
Serial parsing is the simplest path:
CSVParserCore owns the DFA state for this path. It handles:
For each field, the parser records chunk-local metadata instead of building a general std::string per value:
The start and length point into RawCSVData::data, which views the immutable backing source window. For fields that contain doubled quotes, the parser writes the unescaped bytes into RawCSVData::quote_arena and sets is_realized; for those fields, start and length refer to the arena instead. Fields without doubled quotes keep using the raw backing bytes.
Speculative parsing is used only when threading is compiled in, runtime threading is enabled, speculative parsing is requested, and the source is large enough.
The high-level flow is:
The key idea is that worker chunks need a guessed initial DFA state. The speculator asks one question:
Does this chunk probably start inside a quoted field?
The scanner uses lightweight quote/newline evidence and ambiguity heuristics to choose an initial ParserDFAState. Worker parsers then parse chunks independently.
The validator is the safety gate. It releases rows only after checking that the previous chunk's ending state matches the next chunk's expected starting state. If speculation was wrong, the validator repairs by reparsing the affected bytes with the correct state.
After validation, rows are ordinary CSVRow objects. Consumer-side field access does not know whether the row came from serial parsing or speculative parsing.
Low-level flow:
Important ownership detail: each speculative chunk carries an owner shared pointer for the bytes it parsed. When the validator repairs by concatenating fragments, the repaired bytes receive their own owner. Either way, released CSVRow objects keep their backing bytes alive through RawCSVData.
Chunk boundaries can split a CSV record. This is especially common when quoted fields contain embedded newlines.
Speculative parsing represents edge pieces explicitly:
prefix_fragmentcomplete_rowssuffix_fragmentThe validator owns fragment stitching. That is where split rows become complete rows before being released to RowCollection.
Serial parsing handles the same semantic problem through remainder/backtracking:
MmapParser adjusts the next mmap offset.StreamParser stores leftover bytes and prepends them to the next read.Parsed rows are pushed into RowCollection.
ThreadSafeDeque<CSVRow>.SingleThreadDeque<CSVRow>.CSVFormat::threading(false) changes scheduling, not the queue type. In a thread-enabled build, the reader still owns a ThreadSafeDeque<CSVRow>, but the read cycle runs synchronously and no background producer races the consumer. Swapping to SingleThreadDeque for this runtime opt-out would be a micro- optimization, not a correctness requirement.
Both queues satisfy the parser queue concept: push rows, append row batches, pop rows, drain rows, and expose wait/notify hooks. Diagnostic helpers such as ThreadSafeDeque::inspect() are intentionally not part of the shared queue contract.
CSVRow is a lightweight view over shared RawCSVData.
When the user asks for a field:
the row looks up the field metadata, slices either RawCSVData::data or RawCSVData::quote_arena, applies trim if configured, and returns a CSVField.
CSVRow / CSVField split the remaining work this way:
string_view.RawCSVData::quote_arena by CSVParserCore.CSVRow creates the field view.CSVField::type(), get<T>(), and try_get<T>().This keeps parser throughput focused on boundary detection instead of string construction.
Expected copies:
CSVField::get<std::string>() returns an owning string.RawCSVData::quote_arena.Expected non-copies:
CSVRow copies are cheap shared ownership transfers.Check these first when field data looks wrong:
RawCSVData backing ownership lifetime.RawCSVData::quote_arena.CSVRow.CSVField.CSVFormat::threading(false) path, which should change scheduling only, not row contents.If the field boundary is wrong, start in CSVParserCore or the speculative fragment/validator path.
If the boundary is right but the string value is wrong, start in CSVRow, CSVField, or RawCSVData ownership/materialization.
If only large files fail, suspect chunk boundaries.
If only one constructor fails, suspect MmapParser or StreamParser.