Vince's CSV Parser
Loading...
Searching...
No Matches
High Performance ETL

High Performance ETL

This page covers the highest-leverage ETL-oriented APIs in csv-parser:

The library supports both low-level batch control and a higher-level chunked parallel helper. Which one you choose mainly depends on whether you need to mutate a batch before analysis.

Two Common ETL Shapes

1. Read chunk -> build a batch <tt>DataFrame</tt> -> edit -> analyze

This is the most flexible path.

Use it when you want to:

  • keep memory bounded
  • apply sparse edits to a batch before analysis
  • perform row-wise ETL tasks such as null-ish value coercion
  • run group_by(), column(), or column_parallel_apply() on just the rows in the current chunk
std::istringstream input(
"id,name,title,responsibilities\n"
"1,Ada,,Makes sure the framework can fit one more config file into your repo\n"
"2,Brendan,n/a,Makes sure customers spend as much server compute as possible\n"
"3,Casey,Platform SRE,NULL\n"
"4,Drew,NULL,Reminds everyone that a build is not done until analytics can over-explain it\n"
"5,Emery,Frontend Cloud Liaison,\n"
"6,Fin,na,Writes dashboards that imply the outage was actually a growth event\n"
"7,Gale,Deployment Therapist,n/a\n"
"8,Harper,none,Convinces functions to run longer in the name of customer love\n"
"9,Indy,Preview Environment Curator,none\n"
"10,Jules,Performance Marketing Engineer,NA\n"
);
CSVReader reader(input);
std::vector<CSVRow> rows;
REQUIRE(reader.read_chunk(rows, 10));
DataFrame<> batch(std::move(rows));
// You can edit the DataFrame before processing
batch[0]["title"] = "Developer Experience Engineer";
batch[2]["responsibilities"] = "Keeps preview deployments alive through sheer caffeine density";
batch[7]["responsibilities"] = "";
auto is_nullish = [](std::string value) {
std::transform(value.begin(), value.end(), value.begin(), [](unsigned char ch) {
return static_cast<char>(std::tolower(ch));
});
return value.empty() || value == "null" || value == "n/a" || value == "na" || value == "none";
};
auto normalize_nullish = [&is_nullish](const std::string& value) {
return is_nullish(value) ? std::string() : value;
};
const DataFrame<>& read_only_batch = batch;
const size_t title_index = batch.index_of("title");
const size_t responsibilities_index = batch.index_of("responsibilities");
const std::vector<size_t> selected_columns = { title_index, responsibilities_index };
std::vector<std::vector<std::string>> coalesced(selected_columns.size());
DataFrameExecutor executor(2);
// Perform null-coalescing on selected columns in parallel
batch.column_parallel_apply(executor, selected_columns,
[&read_only_batch, &normalize_nullish, &coalesced, title_index](DataFrame<>::column_type column) {
auto& resolved_values = coalesced[column.index() == title_index ? 0 : 1];
resolved_values.reserve(column.size());
for (size_t row_index = 0; row_index < column.size(); ++row_index) {
resolved_values.push_back(normalize_nullish(
read_only_batch.at(row_index)[column.name()].get<std::string>()
));
}
}
);
const std::vector<std::string> expected_titles = {
"Developer Experience Engineer",
"",
"Platform SRE",
"",
"Frontend Cloud Liaison",
"",
"Deployment Therapist",
"",
"Preview Environment Curator",
"Performance Marketing Engineer"
};
const std::vector<std::string> expected_responsibilities = {
"Makes sure the framework can fit one more config file into your repo",
"Makes sure customers spend as much server compute as possible",
"Keeps preview deployments alive through sheer caffeine density",
"Reminds everyone that a build is not done until analytics can over-explain it",
"",
"Writes dashboards that imply the outage was actually a growth event",
"",
"",
"",
""
};
REQUIRE(coalesced.size() == selected_columns.size());
REQUIRE(coalesced[0] == expected_titles);
REQUIRE(coalesced[1] == expected_responsibilities);
REQUIRE(batch[0]["title"].get<std::string>() == "Developer Experience Engineer");
REQUIRE(batch[2]["responsibilities"].get<std::string>() == "Keeps preview deployments alive through sheer caffeine density");
REQUIRE(batch[7]["responsibilities"].get<std::string>().empty());

The example above shows a very common ETL shape:

  • read a bounded chunk from an in-memory or file-backed source
  • apply sparse overlay edits to normalize selected cells
  • normalize null-ish values to empty strings
  • schedule work only for the columns that matter
  • keep the result in explicit worker-owned state

This pattern works well because DataFrame is a short-lived batch bridge:

  • CSVReader streams rows into a caller-owned std::vector<CSVRow>
  • DataFrame wraps that batch
  • sparse overlay edits are applied only where needed
  • parallel analysis runs against the edited view
  • the batch is discarded before the next chunk

That design avoids many of the lifetime and aliasing problems that show up when parallelism is layered onto a long-lived mutable table.

2. Use <tt>chunk_parallel_apply()</tt> directly

This is the common-case helper when you do not need to mutate each batch before processing.

chunk_parallel_apply():

  • reads from CSVReader in bounded chunks
  • wraps each chunk in a temporary DataFrame
  • runs a per-column callback using DataFrameExecutor
  • can either accumulate results in one explicit state object per column or let you manage output storage externally via DataFrameColumn::index()
  • can target all columns or just a selected subset by column index

This is usually the shortest path for:

  • schema inference
  • column summaries
  • frequency counts
  • per-column aggregation passes

Choosing Between <tt>read_chunk()</tt> and <tt>chunk_parallel_apply()</tt>

Use read_chunk() when:

  • you need to edit the batch before analysis
  • you want to filter or promote only some rows into a DataFrame
  • you need custom batch-level orchestration

Use chunk_parallel_apply() when:

  • the chunk can be treated as read-only
  • you want the simplest path to chunked parallel column processing
  • one state object per column is a natural fit for the workload

Thread-Safety Notes

DataFrameExecutor is designed around batch-scoped DataFrame objects, which keeps the parallel story much cleaner than a long-lived shared table.

Safe patterns:

  • reading from the provided column view
  • reading from explicit references to batch state captured by the caller
  • updating only the worker's own per-column state object
  • applying sparse-overlay cell edits through DataFrameRow / DataFrameCell when workers are updating row data in place

Unsafe pattern:

  • structural mutation during parallel work, such as erase()
  • relying on conflicting concurrent writes to the same cells unless last-write-wins behavior is acceptable

For simple cell updates, the row-local overlay lock is there to make rare collisions boring rather than catastrophic. For structural changes, stay on the caller thread.

Why This Matters

Many ETL workflows fundamentally look like this:

  1. read a chunk
  2. materialize usable row objects
  3. perform a small number of transformations or edits
  4. aggregate or emit

csv-parser is optimized for exactly that shape. It is not just a parser; it also provides the batch-bridge and parallel column-processing pieces that would otherwise need to be hand-rolled on top of a lower-level CSV reader.