High Performance ETL

This page covers the highest-leverage ETL-oriented APIs in csv-parser:

The library supports both low-level batch control and a higher-level chunked parallel helper. Which one you choose mainly depends on whether you need to mutate a batch before analysis.

Two Common ETL Shapes

1. Read chunk -> build a batch <tt>DataFrame</tt> -> edit -> analyze

This is the most flexible path.

Use it when you want to:

keep memory bounded
apply sparse edits to a batch before analysis
perform row-wise ETL tasks such as null-ish value coercion
run group_by(), column(), or column_parallel_apply() on just the rows in the current chunk

    std::istringstream input(
        "id,name,title,responsibilities\n"
        "1,Ada,,Makes sure the framework can fit one more config file into your repo\n"
        "2,Brendan,n/a,Makes sure customers spend as much server compute as possible\n"
        "3,Casey,Platform SRE,NULL\n"
        "4,Drew,NULL,Reminds everyone that a build is not done until analytics can over-explain it\n"
        "5,Emery,Frontend Cloud Liaison,\n"
        "6,Fin,na,Writes dashboards that imply the outage was actually a growth event\n"
        "7,Gale,Deployment Therapist,n/a\n"
        "8,Harper,none,Convinces functions to run longer in the name of customer love\n"
        "9,Indy,Preview Environment Curator,none\n"
        "10,Jules,Performance Marketing Engineer,NA\n"
    );
 
    CSVReader reader(input);
    std::vector<CSVRow> rows;
    REQUIRE(reader.read_chunk(rows, 10));
 
    DataFrame<> batch(std::move(rows));
 
    // You can edit the DataFrame before processing
    batch[0]["title"] = "Developer Experience Engineer";
    batch[2]["responsibilities"] = "Keeps preview deployments alive through sheer caffeine density";
    batch[7]["responsibilities"] = "";
 
    auto is_nullish = [](std::string value) {
        std::transform(value.begin(), value.end(), value.begin(), [](unsigned char ch) {
            return static_cast<char>(std::tolower(ch));
        });
 
        return value.empty() || value == "null" || value == "n/a" || value == "na" || value == "none";
    };
 
    auto normalize_nullish = [&is_nullish](const std::string& value) {
        return is_nullish(value) ? std::string() : value;
    };
 
    const DataFrame<>& read_only_batch = batch;
    const size_t title_index = batch.index_of("title");
    const size_t responsibilities_index = batch.index_of("responsibilities");
    const std::vector<size_t> selected_columns = { title_index, responsibilities_index };
    std::vector<std::vector<std::string>> coalesced(selected_columns.size());
    DataFrameExecutor executor(2);
 
    // Perform null-coalescing on selected columns in parallel
    batch.column_parallel_apply(executor, selected_columns,
        [&read_only_batch, &normalize_nullish, &coalesced, title_index](DataFrame<>::column_type column) {
            auto& resolved_values = coalesced[column.index() == title_index ? 0 : 1];
            resolved_values.reserve(column.size());
            for (size_t row_index = 0; row_index < column.size(); ++row_index) {
                resolved_values.push_back(normalize_nullish(
                    read_only_batch.at(row_index)[column.name()].get<std::string>()
                ));
            }
        }
    );
 
    const std::vector<std::string> expected_titles = {
        "Developer Experience Engineer",
        "",
        "Platform SRE",
        "",
        "Frontend Cloud Liaison",
        "",
        "Deployment Therapist",
        "",
        "Preview Environment Curator",
        "Performance Marketing Engineer"
    };
 
    const std::vector<std::string> expected_responsibilities = {
        "Makes sure the framework can fit one more config file into your repo",
        "Makes sure customers spend as much server compute as possible",
        "Keeps preview deployments alive through sheer caffeine density",
        "Reminds everyone that a build is not done until analytics can over-explain it",
        "",
        "Writes dashboards that imply the outage was actually a growth event",
        "",
        "",
        "",
        ""
    };
 
    REQUIRE(coalesced.size() == selected_columns.size());
    REQUIRE(coalesced[0] == expected_titles);
    REQUIRE(coalesced[1] == expected_responsibilities);
 
    REQUIRE(batch[0]["title"].get<std::string>() == "Developer Experience Engineer");
    REQUIRE(batch[2]["responsibilities"].get<std::string>() == "Keeps preview deployments alive through sheer caffeine density");
    REQUIRE(batch[7]["responsibilities"].get<std::string>().empty());

The example above shows a very common ETL shape:

read a bounded chunk from an in-memory or file-backed source
apply sparse overlay edits to normalize selected cells
normalize null-ish values to empty strings
schedule work only for the columns that matter
keep the result in explicit worker-owned state

This pattern works well because DataFrame is a short-lived batch bridge:

CSVReader streams rows into a caller-owned std::vector<CSVRow>
DataFrame wraps that batch
sparse overlay edits are applied only where needed
parallel analysis runs against the edited view
the batch is discarded before the next chunk

That design avoids many of the lifetime and aliasing problems that show up when parallelism is layered onto a long-lived mutable table.

2. Use <tt>chunk_parallel_apply()</tt> directly

This is the common-case helper when you do not need to mutate each batch before processing.

chunk_parallel_apply():

reads from CSVReader in bounded chunks
wraps each chunk in a temporary DataFrame
runs a per-column callback using DataFrameExecutor
can either accumulate results in one explicit state object per column or let you manage output storage externally via DataFrameColumn::index()
can target all columns or just a selected subset by column index

This is usually the shortest path for:

schema inference
column summaries
frequency counts
per-column aggregation passes

Choosing Between <tt>read_chunk()</tt> and <tt>chunk_parallel_apply()</tt>

Use read_chunk() when:

you need to edit the batch before analysis
you want to filter or promote only some rows into a DataFrame
you need custom batch-level orchestration

Use chunk_parallel_apply() when:

the chunk can be treated as read-only
you want the simplest path to chunked parallel column processing
one state object per column is a natural fit for the workload

Thread-Safety Notes

DataFrameExecutor is designed around batch-scoped DataFrame objects, which keeps the parallel story much cleaner than a long-lived shared table.

Safe patterns:

reading from the provided column view
reading from explicit references to batch state captured by the caller
updating only the worker's own per-column state object
applying sparse-overlay cell edits through DataFrameRow / DataFrameCell when workers are updating row data in place

Unsafe pattern:

structural mutation during parallel work, such as erase()
relying on conflicting concurrent writes to the same cells unless last-write-wins behavior is acceptable

For simple cell updates, the row-local overlay lock is there to make rare collisions boring rather than catastrophic. For structural changes, stay on the caller thread.

Why This Matters

Many ETL workflows fundamentally look like this:

read a chunk
materialize usable row objects
perform a small number of transformations or edits
aggregate or emit

csv-parser is optimized for exactly that shape. It is not just a parser; it also provides the batch-bridge and parallel column-processing pieces that would otherwise need to be hand-rolled on top of a lower-level CSV reader.