|
Vince's CSV Parser
|
csv_tuning is a small benchmark helper for finding a good chunk size and speculative parser worker count for your machine and dataset.
It is useful when:
csv_tuning repeatedly parses one input file with a matrix of:
The output is CSV so it can be pasted into a spreadsheet, plotted, or compared between machines.
Example output columns include:
chunk_bytesrequested_threadsparser_threadssecondsMiB_per_srowscolumnsspec_chunks, ambiguous, and repairsUseful options:
--threads 0 means "choose automatically", matching CSVFormat::speculative_parallel_threads(0).
For most workloads, start with the defaults:
Then look for the smallest configuration that reaches near-peak throughput. Using every hardware thread is not always best for medium-sized files; fewer workers may reduce scheduling overhead and leave more CPU for downstream ETL work.
The library default enables speculative parsing at 50MB when runtime threading is enabled. You can lower or raise that threshold:
For tiny CSVs, use threading(false) or set a higher threshold. For bulk ETL on large files, lower thresholds and explicit worker counts may be worth testing.