The library supports both low-level batch control and a higher-level chunked parallel helper. Which one you choose mainly depends on whether you need to mutate a batch before analysis.
This is the most flexible path.
std::istringstream input(
"id,name,title,responsibilities\n"
"1,Ada,,Makes sure the framework can fit one more config file into your repo\n"
"2,Brendan,n/a,Makes sure customers spend as much server compute as possible\n"
"3,Casey,Platform SRE,NULL\n"
"4,Drew,NULL,Reminds everyone that a build is not done until analytics can over-explain it\n"
"5,Emery,Frontend Cloud Liaison,\n"
"6,Fin,na,Writes dashboards that imply the outage was actually a growth event\n"
"7,Gale,Deployment Therapist,n/a\n"
"8,Harper,none,Convinces functions to run longer in the name of customer love\n"
"9,Indy,Preview Environment Curator,none\n"
"10,Jules,Performance Marketing Engineer,NA\n"
);
CSVReader reader(input);
std::vector<CSVRow> rows;
REQUIRE(reader.read_chunk(rows, 10));
DataFrame<> batch(std::move(rows));
batch[0]["title"] = "Developer Experience Engineer";
batch[2]["responsibilities"] = "Keeps preview deployments alive through sheer caffeine density";
batch[7]["responsibilities"] = "";
auto is_nullish = [](std::string value) {
std::transform(value.begin(), value.end(), value.begin(), [](unsigned char ch) {
return static_cast<char>(std::tolower(ch));
});
return value.empty() || value == "null" || value == "n/a" || value == "na" || value == "none";
};
auto normalize_nullish = [&is_nullish](const std::string& value) {
return is_nullish(value) ? std::string() : value;
};
const DataFrame<>& read_only_batch = batch;
const size_t title_index = batch.index_of("title");
const size_t responsibilities_index = batch.index_of("responsibilities");
const std::vector<size_t> selected_columns = { title_index, responsibilities_index };
std::vector<std::vector<std::string>> coalesced(selected_columns.size());
DataFrameExecutor executor(2);
batch.column_parallel_apply(executor, selected_columns,
[&read_only_batch, &normalize_nullish, &coalesced, title_index](DataFrame<>::column_type column) {
auto& resolved_values = coalesced[column.index() == title_index ? 0 : 1];
resolved_values.reserve(column.size());
for (size_t row_index = 0; row_index < column.size(); ++row_index) {
resolved_values.push_back(normalize_nullish(
read_only_batch.at(row_index)[column.name()].get<std::string>()
));
}
}
);
const std::vector<std::string> expected_titles = {
"Developer Experience Engineer",
"",
"Platform SRE",
"",
"Frontend Cloud Liaison",
"",
"Deployment Therapist",
"",
"Preview Environment Curator",
"Performance Marketing Engineer"
};
const std::vector<std::string> expected_responsibilities = {
"Makes sure the framework can fit one more config file into your repo",
"Makes sure customers spend as much server compute as possible",
"Keeps preview deployments alive through sheer caffeine density",
"Reminds everyone that a build is not done until analytics can over-explain it",
"",
"Writes dashboards that imply the outage was actually a growth event",
"",
"",
"",
""
};
REQUIRE(coalesced.size() == selected_columns.size());
REQUIRE(coalesced[0] == expected_titles);
REQUIRE(coalesced[1] == expected_responsibilities);
REQUIRE(batch[0]["title"].get<std::string>() == "Developer Experience Engineer");
REQUIRE(batch[2]["responsibilities"].get<std::string>() == "Keeps preview deployments alive through sheer caffeine density");
REQUIRE(batch[7]["responsibilities"].get<std::string>().empty());
That design avoids many of the lifetime and aliasing problems that show up when parallelism is layered onto a long-lived mutable table.
This is the common-case helper when you do not need to mutate each batch before processing.
For simple cell updates, the row-local overlay lock is there to make rare collisions boring rather than catastrophic. For structural changes, stay on the caller thread.