API Reference¶
This page documents the stable Python-facing API. reader(), read_numpy(),
read_numpy_batches(), and write_csv() are the primary surface area. The
lower-level extension objects exist for users who need direct access to parser
concepts, but ordinary ETL code should not start there.
Primary API¶
Function |
Use it when |
|---|---|
|
You want fast lazy row iteration and Pythonic row filtering. |
|
You want selected CSV columns as eager NumPy arrays. |
|
You want selected CSV columns as bounded-memory NumPy batches. |
|
You want to stream lazy rows or Python iterables back to CSV. |
fastpycsv.reader(csvfile, dialect="excel", **fmtparams)¶
Returns an iterator over lazy row objects.
By default, reader() consumes the first row as column names. Use
reader.fieldnames or reader.get_col_names() to retrieve those names before
or during iteration.
Pass consume_header=False for raw stdlib-style row iteration where the first
input row is emitted as data. Pass fieldnames=[...] to attach explicit column
names without consuming the first input row.
Supported formatting options:
delimiter: one-character delimiter. Defaults to",".quotechar: one-character quote character, orNoneto disable quoting.doublequote: must beTrue.skipinitialspace: trim leading spaces after delimiters.strict: throw on variable-width rows.cast: return Python scalar values instead of strings.typed: alias forcast.consume_header: consume the first row as column names. Defaults toTrue.fieldnames: explicit column names. When provided, the first row is not consumed.batch_size: advanced performance hint for filtered or projected exports. Most users can ignore it.
Only the default excel dialect is currently supported. Unsupported dialect
features fail fast instead of silently diverging from stdlib behavior.
batch_size is not the same thing as .chunks(size). For ordinary lazy
iteration, reader() yields one row at a time no matter what batch_size is:
for row in fastpycsv.reader("vehicles.csv", batch_size=100_000):
consume(row) # still one row at a time
Use .chunks(size) on reader.lists(), reader.tuples(), or reader.dicts()
when you want Python lists of rows. batch_size only controls how many rows
fastpycsv asks the native parser to process at once while doing filtered or
projected materialization, such as .filter(...).dicts(...).all(). The default
is a good starting point; tune it only after benchmarking a large filtered
export.
Use reader.filter(predicate) to apply native row filtering before lazy or
materialized rows are emitted:
predicate = fastpycsv.all_of(
fastpycsv.equal("region", "el paso", case_sensitive=False),
fastpycsv.less("price", "15000"),
)
for row in fastpycsv.reader("vehicles.csv").filter(predicate):
consume(row)
batches = fastpycsv.reader("vehicles.csv").filter(predicate).dicts(["id", "price"]).chunks(50_000)
Predicates are created with fastpycsv.equal(), less(), less_equal(),
greater(), greater_equal(), and all_of(). Predicate column names are
validated once against the reader’s column names before iteration continues.
Calling filter() more than once combines predicates with all_of() by
default. Pass append=False to replace the current predicate:
reader = reader.filter(new_predicate, append=False)
Lazy Row Objects¶
Rows returned by reader() support:
integer indexing:
row[0]column-name indexing when headers are available:
row["name"]iteration:
list(row)len(row)row.as_list(columns=None); pass column names to materialize a subsetrow.as_tuple(columns=None); pass column names to materialize a subsetrow.as_dict(columns=None); pass column names to materialize a subsettyped access helpers:
get_str,get_int,get_float,get_boolrow.type(index)for native scalar classification
Materialized Row Iterators¶
Use these when you want plain Python row objects but still want bounded-memory streaming:
Each materialized iterator supports:
normal iteration, yielding one materialized row at a time
.chunks(size), yieldinglistbatches of materialized rows.all(), consuming the remaining rows into one Pythonlist
reader.lists(columns=None)¶
Use this when the downstream API expects mutable row lists.
import fastpycsv
for row in fastpycsv.reader("vehicles.csv").lists(["id", "price"]):
assert isinstance(row, list)
send_to_api(row)
reader.tuples(columns=None)¶
Use this when the downstream API expects fixed-shape rows.
rows = fastpycsv.reader("vehicles.csv").tuples(["id", "year"]).all()
# [('1', '2021'), ('2', '2020'), ...]
reader.dicts(columns=None)¶
Use this when the downstream API wants named fields per row.
rows = fastpycsv.reader("vehicles.csv").dicts(["id", "price"]).all()
# [{'id': '1', 'price': '9000'}, {'id': '2', 'price': '12000'}, ...]
.chunks(size)¶
Use chunks when you want plain Python objects but need bounded peak memory.
for batch in fastpycsv.reader("vehicles.csv").dicts(["id", "price"]).chunks(50_000):
assert isinstance(batch, list)
bulk_insert(batch)
Filtering Before Materialization¶
Native predicates compose with materialized row iterators, so filtering can stay in C++ before rows become Python objects.
predicate = fastpycsv.all_of(
fastpycsv.equal("region", "el paso", case_sensitive=False),
fastpycsv.less("price", 10_000),
)
for batch in (
fastpycsv.reader("vehicles.csv")
.filter(predicate)
.dicts(["id", "price", "region"])
.chunks(10_000)
):
send_to_api(batch)
Column Selection¶
Column selection controls both output order and shape.
reader = fastpycsv.reader("vehicles.csv")
reader.lists(["price", "id"]).all()
# [['9000', '1'], ['12000', '2'], ...]
reader.tuples(["price", "id"]).all()
# [('9000', '1'), ('12000', '2'), ...]
reader.dicts(["price", "id"]).all()
# [{'price': '9000', 'id': '1'}, {'price': '12000', 'id': '2'}, ...]
Bulk Convenience Methods¶
The older convenience methods reader.to_lists(columns=None),
reader.to_tuples(columns=None), and reader.to_dicts(columns=None) are
equivalent to calling .all() on the corresponding materialized iterator.
reader = fastpycsv.reader("vehicles.csv")
reader.to_lists(["id", "price"])
# reader.lists(["id", "price"]).all()
reader.to_tuples(["id", "price"])
# reader.tuples(["id", "price"]).all()
reader.to_dicts(["id", "price"])
# reader.dicts(["id", "price"]).all()
fastpycsv.read_numpy(path, columns=None, *, cast=True, predicate=None, **fmtparams)¶
Parses selected columns into NumPy arrays keyed by column name.
Use this when the target is pandas, NumPy, or another column-oriented consumer:
arrays = fastpycsv.read_numpy("vehicles.csv", columns=["price", "year", "odometer"])
predicate may be a native fastpycsv predicate such as equal(), less(), or
all_of(). With cast=True, fastpycsv classifies scalar fields and maps them to
NumPy-friendly column types.
read_numpy() accepts the same CSV format keywords as reader(), including
delimiter, quotechar, skipinitialspace, strict, consume_header, and
fieldnames.
See NumPy and pandas for dtype behavior and batching details.
fastpycsv.read_numpy_batches(path, columns=None, *, predicate=None, cast=True, batch_size=50000, schema="sample", **fmtparams)¶
Streams dictionaries of NumPy arrays. This is the bounded-memory version of
read_numpy().
for arrays in fastpycsv.read_numpy_batches("vehicles.csv", columns=["price", "year"]):
consume(arrays)
schema controls dtype inference:
"sample": infer once from the first bounded batch, then stream once."global": pre-scan the file for stable full-file dtypes."batch": infer each emitted batch independently.
read_numpy_batches() accepts the same CSV format keywords as reader() and
read_numpy().
For both NumPy readers, path and columns may be positional. Filtering,
casting, batching, schema, and CSV format options are keyword-only.
fastpycsv.write_csv(csvfile, rows, **options)¶
Writes CSV rows to a path-like output file or text file-like object with a
write() method.
rows may contain lazy fastpycsv rows, dictionaries, lists, tuples, or other
Python iterables. Fields are stringified before writing; None becomes an empty
field.
Supported options:
fieldnames: optional output column names. When writing dictionaries, this controls output order and selection. When omitted for dictionary rows, names are inferred from the first row.write_header: writefieldnamesas the first output row. Defaults toTrue.quote_minimal: quote only fields that require escaping. Defaults toTrue.
Example:
reader = fastpycsv.reader("vehicles.csv")
fastpycsv.write_csv(
"subset.csv",
(row for row in reader if row["region"] == "el paso"),
fieldnames=["id", "price", "region"],
)
with open("subset.csv", "w", newline="", encoding="utf-8") as out:
fastpycsv.write_csv(out, [["id", "price"], [1, 9000]], write_header=False)
Low-Level Types¶
Most users do not need these. They are exposed for compatibility and specialized inspection:
fastpycsv.Readerfastpycsv.Rowfastpycsv.Fieldfastpycsv.DataTypefastpycsv.get_file_info()fastpycsv.csv_data_types()fastpycsv.parse_no_header()
These are useful when you want direct access to underlying parser concepts that
still have a clear Python use. C++ configuration machinery such as CSVFormat
and VariableColumnPolicy is used internally by the facade, but is intentionally
kept out of the stable Python surface.