Benchmarks¶
The Python benchmark helpers live under python/benchmarks/.
Compare reader throughput:
python python/benchmarks/compare_readers.py path/to/input.csv
Compare the streaming EDA example against pandas with the pyarrow CSV engine:
python python/benchmarks/compare_eda.py path/to/input.csv
Compare a streaming fastpycsv filter against pyarrow on a Used Cars-style
region == "el paso" workload:
python python/benchmarks/compare_filter.py path/to/vehicles.csv
Compare selected-column CSV-to-NumPy materialization against pyarrow:
python python/benchmarks/compare_numpy_materialization.py path/to/vehicles.csv
Compare Python object materialization against pyarrow and Polars:
python python/benchmarks/compare_python_materialization.py path/to/vehicles.csv
This compares full CSV and first+last-column subset materialization for
row-oriented list[dict] outputs and column-oriented dict[str, list] outputs.
The NumPy materialization benchmark includes pyarrow’s direct CSV reader,
pyarrow’s Dataset/Scanner path, pyarrow’s streaming CSV reader, and Polars lazy
scanning. The Dataset row is included because it is pyarrow’s closest API for
parallel predicate/projection work, but it is also more configuration-sensitive
than pyarrow.csv.read_csv(): without explicit column types, pyarrow Dataset
may infer a narrow integer type from early rows and later fail on wider values.
The benchmark pins obvious numeric columns such as price, year, and
odometer to avoid measuring that crash path instead of CSV/filter throughput.
The helpers run workloads in fresh Python processes and report wall time, throughput, and peak resident or working-set memory.
Use the same Python version that built fastpycsv; for example, a cp310
extension will not import under Python 3.14.