# API Reference This page documents the stable Python-facing API. `reader()`, `read_numpy()`, `read_numpy_batches()`, and `write_csv()` are the primary surface area. The lower-level extension objects exist for users who need direct access to parser concepts, but ordinary ETL code should not start there. ## Primary API | Function | Use it when | | --- | --- | | `fastpycsv.reader()` | You want fast lazy row iteration and Pythonic row filtering. | | `fastpycsv.read_numpy()` | You want selected CSV columns as eager NumPy arrays. | | `fastpycsv.read_numpy_batches()` | You want selected CSV columns as bounded-memory NumPy batches. | | `fastpycsv.write_csv()` | You want to stream lazy rows or Python iterables back to CSV. | ## `fastpycsv.reader(csvfile, dialect="excel", **fmtparams)` Returns an iterator over lazy row objects. By default, `reader()` consumes the first row as column names. Use `reader.fieldnames` or `reader.get_col_names()` to retrieve those names before or during iteration. Pass `consume_header=False` for raw stdlib-style row iteration where the first input row is emitted as data. Pass `fieldnames=[...]` to attach explicit column names without consuming the first input row. Supported formatting options: - `delimiter`: one-character delimiter. Defaults to `","`. - `quotechar`: one-character quote character, or `None` to disable quoting. - `doublequote`: must be `True`. - `skipinitialspace`: trim leading spaces after delimiters. - `strict`: throw on variable-width rows. - `cast`: return Python scalar values instead of strings. - `typed`: alias for `cast`. - `consume_header`: consume the first row as column names. Defaults to `True`. - `fieldnames`: explicit column names. When provided, the first row is not consumed. - `batch_size`: advanced performance hint for filtered or projected exports. Most users can ignore it. Only the default `excel` dialect is currently supported. Unsupported dialect features fail fast instead of silently diverging from stdlib behavior. `batch_size` is not the same thing as `.chunks(size)`. For ordinary lazy iteration, `reader()` yields one row at a time no matter what `batch_size` is: ```python for row in fastpycsv.reader("vehicles.csv", batch_size=100_000): consume(row) # still one row at a time ``` Use `.chunks(size)` on `reader.lists()`, `reader.tuples()`, or `reader.dicts()` when you want Python lists of rows. `batch_size` only controls how many rows fastpycsv asks the native parser to process at once while doing filtered or projected materialization, such as `.filter(...).dicts(...).all()`. The default is a good starting point; tune it only after benchmarking a large filtered export. Use `reader.filter(predicate)` to apply native row filtering before lazy or materialized rows are emitted: ```python predicate = fastpycsv.all_of( fastpycsv.equal("region", "el paso", case_sensitive=False), fastpycsv.less("price", "15000"), ) for row in fastpycsv.reader("vehicles.csv").filter(predicate): consume(row) batches = fastpycsv.reader("vehicles.csv").filter(predicate).dicts(["id", "price"]).chunks(50_000) ``` Predicates are created with `fastpycsv.equal()`, `less()`, `less_equal()`, `greater()`, `greater_equal()`, and `all_of()`. Predicate column names are validated once against the reader's column names before iteration continues. Calling `filter()` more than once combines predicates with `all_of()` by default. Pass `append=False` to replace the current predicate: ```python reader = reader.filter(new_predicate, append=False) ``` ## Lazy Row Objects Rows returned by `reader()` support: - integer indexing: `row[0]` - column-name indexing when headers are available: `row["name"]` - iteration: `list(row)` - `len(row)` - `row.as_list(columns=None)`; pass column names to materialize a subset - `row.as_tuple(columns=None)`; pass column names to materialize a subset - `row.as_dict(columns=None)`; pass column names to materialize a subset - typed access helpers: `get_str`, `get_int`, `get_float`, `get_bool` - `row.type(index)` for native scalar classification ## Materialized Row Iterators Use these when you want plain Python row objects but still want bounded-memory streaming: Each materialized iterator supports: - normal iteration, yielding one materialized row at a time - `.chunks(size)`, yielding `list` batches of materialized rows - `.all()`, consuming the remaining rows into one Python `list` ### `reader.lists(columns=None)` Use this when the downstream API expects mutable row lists. ```python import fastpycsv for row in fastpycsv.reader("vehicles.csv").lists(["id", "price"]): assert isinstance(row, list) send_to_api(row) ``` ### `reader.tuples(columns=None)` Use this when the downstream API expects fixed-shape rows. ```python rows = fastpycsv.reader("vehicles.csv").tuples(["id", "year"]).all() # [('1', '2021'), ('2', '2020'), ...] ``` ### `reader.dicts(columns=None)` Use this when the downstream API wants named fields per row. ```python rows = fastpycsv.reader("vehicles.csv").dicts(["id", "price"]).all() # [{'id': '1', 'price': '9000'}, {'id': '2', 'price': '12000'}, ...] ``` ### `.chunks(size)` Use chunks when you want plain Python objects but need bounded peak memory. ```python for batch in fastpycsv.reader("vehicles.csv").dicts(["id", "price"]).chunks(50_000): assert isinstance(batch, list) bulk_insert(batch) ``` ### Filtering Before Materialization Native predicates compose with materialized row iterators, so filtering can stay in C++ before rows become Python objects. ```python predicate = fastpycsv.all_of( fastpycsv.equal("region", "el paso", case_sensitive=False), fastpycsv.less("price", 10_000), ) for batch in ( fastpycsv.reader("vehicles.csv") .filter(predicate) .dicts(["id", "price", "region"]) .chunks(10_000) ): send_to_api(batch) ``` ### Column Selection Column selection controls both output order and shape. ```python reader = fastpycsv.reader("vehicles.csv") reader.lists(["price", "id"]).all() # [['9000', '1'], ['12000', '2'], ...] reader.tuples(["price", "id"]).all() # [('9000', '1'), ('12000', '2'), ...] reader.dicts(["price", "id"]).all() # [{'price': '9000', 'id': '1'}, {'price': '12000', 'id': '2'}, ...] ``` ### Bulk Convenience Methods The older convenience methods `reader.to_lists(columns=None)`, `reader.to_tuples(columns=None)`, and `reader.to_dicts(columns=None)` are equivalent to calling `.all()` on the corresponding materialized iterator. ```python reader = fastpycsv.reader("vehicles.csv") reader.to_lists(["id", "price"]) # reader.lists(["id", "price"]).all() reader.to_tuples(["id", "price"]) # reader.tuples(["id", "price"]).all() reader.to_dicts(["id", "price"]) # reader.dicts(["id", "price"]).all() ``` ## `fastpycsv.read_numpy(path, columns=None, *, cast=True, predicate=None, **fmtparams)` Parses selected columns into NumPy arrays keyed by column name. Use this when the target is pandas, NumPy, or another column-oriented consumer: ```python arrays = fastpycsv.read_numpy("vehicles.csv", columns=["price", "year", "odometer"]) ``` `predicate` may be a native fastpycsv predicate such as `equal()`, `less()`, or `all_of()`. With `cast=True`, fastpycsv classifies scalar fields and maps them to NumPy-friendly column types. `read_numpy()` accepts the same CSV format keywords as `reader()`, including `delimiter`, `quotechar`, `skipinitialspace`, `strict`, `consume_header`, and `fieldnames`. See [NumPy and pandas](numpy.md) for dtype behavior and batching details. ## `fastpycsv.read_numpy_batches(path, columns=None, *, predicate=None, cast=True, batch_size=50000, schema="sample", **fmtparams)` Streams dictionaries of NumPy arrays. This is the bounded-memory version of `read_numpy()`. ```python for arrays in fastpycsv.read_numpy_batches("vehicles.csv", columns=["price", "year"]): consume(arrays) ``` `schema` controls dtype inference: - `"sample"`: infer once from the first bounded batch, then stream once. - `"global"`: pre-scan the file for stable full-file dtypes. - `"batch"`: infer each emitted batch independently. `read_numpy_batches()` accepts the same CSV format keywords as `reader()` and `read_numpy()`. For both NumPy readers, `path` and `columns` may be positional. Filtering, casting, batching, schema, and CSV format options are keyword-only. ## `fastpycsv.write_csv(csvfile, rows, **options)` Writes CSV rows to a path-like output file or text file-like object with a `write()` method. `rows` may contain lazy `fastpycsv` rows, dictionaries, lists, tuples, or other Python iterables. Fields are stringified before writing; `None` becomes an empty field. Supported options: - `fieldnames`: optional output column names. When writing dictionaries, this controls output order and selection. When omitted for dictionary rows, names are inferred from the first row. - `write_header`: write `fieldnames` as the first output row. Defaults to `True`. - `quote_minimal`: quote only fields that require escaping. Defaults to `True`. Example: ```python reader = fastpycsv.reader("vehicles.csv") fastpycsv.write_csv( "subset.csv", (row for row in reader if row["region"] == "el paso"), fieldnames=["id", "price", "region"], ) with open("subset.csv", "w", newline="", encoding="utf-8") as out: fastpycsv.write_csv(out, [["id", "price"], [1, 9000]], write_header=False) ``` ## Low-Level Types Most users do not need these. They are exposed for compatibility and specialized inspection: - `fastpycsv.Reader` - `fastpycsv.Row` - `fastpycsv.Field` - `fastpycsv.DataType` - `fastpycsv.get_file_info()` - `fastpycsv.csv_data_types()` - `fastpycsv.parse_no_header()` These are useful when you want direct access to underlying parser concepts that still have a clear Python use. C++ configuration machinery such as `CSVFormat` and `VariableColumnPolicy` is used internally by the facade, but is intentionally kept out of the stable Python surface.