dataframely.testing.typing module

class dataframely.testing.typing.MyImportedBaseSchema[source]

Bases: Schema

Methods

cast(df, /)

Cast a data frame to match the schema.

column_names()

The column names of this schema.

columns()

The column definitions of this schema.

create_empty(*[, lazy])

Create an empty data or lazy frame from this schema.

create_empty_if_none(df, *[, lazy])

Impute None input with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.

filter(df, /, *[, cast])

Filter the data frame by the rules of this schema.

is_valid(df, /, *[, cast])

Utility method to check whether validate() raises an exception.

matches(other)

Check whether this schema semantically matches another schema.

polars_schema()

Obtain the polars schema for this schema.

primary_keys()

The primary key columns in this schema (possibly empty).

pyarrow_schema()

Obtain the pyarrow schema for this schema.

read_delta(source, *[, validation])

Read a Delta Lake table into a typed data frame with this schema.

read_parquet(source, *[, validation])

Read a parquet file into a typed data frame with this schema.

sample([num_rows, overrides, generator])

Create a random data frame with a predefined number of rows.

scan_delta(source, *[, validation])

Lazily read a Delta Lake table into a typed data frame with this schema.

scan_parquet(source, *[, validation])

Lazily read a parquet file into a typed data frame with this schema.

serialize()

Serialize this schema to a JSON string.

sink_parquet(lf, /, file, **kwargs)

Stream a typed lazy frame with this schema to a parquet file.

sql_schema(dialect)

Obtain the SQL schema for a particular dialect for this schema.

validate(df, /, *[, cast])

Validate that a data frame satisfies the schema.

write_delta(df, /, target, **kwargs)

Write a typed data frame with this schema to a Delta Lake table.

write_parquet(df, /, file, **kwargs)

Write a typed data frame with this schema to a parquet file.

a = Int64(nullable=True)
classmethod cast(df: DataFrame | LazyFrame, /) DataFrame[Self] | LazyFrame[Self][source]

Cast a data frame to match the schema.

This method removes superfluous columns and casts all schema columns to the correct dtypes. However, it does not introspect the data frame contents.

Hence, this method should be used with care and validate() should generally be preferred. It is advised to only use this method if df is surely known to adhere to the schema.

Returns:

The input data frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence.

Note:

If you only require a generic data frame for the type checker, consider using typing.cast() instead of this method.

Attention:

For lazy frames, casting is not performed eagerly. This prevents collecting the lazy frame’s schema but also means that a call to collect() further down the line might fail because of the cast and/or missing columns.

classmethod column_names() list[str][source]

The column names of this schema.

classmethod columns() dict[str, Column][source]

The column definitions of this schema.

classmethod create_empty(*, lazy: bool = False) DataFrame[Self] | LazyFrame[Self][source]

Create an empty data or lazy frame from this schema.

Args:
lazy: Whether to create a lazy data frame. If True, returns a lazy frame

with this Schema. Otherwise, returns an eager frame.

Returns:

An instance of polars.DataFrame or polars.LazyFrame with this schema’s defined columns and their data types.

classmethod create_empty_if_none(df: DataFrame[Self] | LazyFrame[Self] | None, *, lazy: bool = False) DataFrame[Self] | LazyFrame[Self][source]

Impute None input with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.

Args:
df: The data frame to check for None. If it is not None, it is

returned as lazy or eager frame. Otherwise, a schema-compliant data or lazy frame with no rows is returned.

lazy: Whether to return a lazy data frame. If True, returns a lazy frame

with this Schema. Otherwise, returns an eager frame.

Returns:

The given data frame df as lazy or eager frame, if it is not None. An instance of polars.DataFrame or polars.LazyFrame with this schema’s defined columns and their data types, but no rows, otherwise.

classmethod filter(df: DataFrame | LazyFrame, /, *, cast: bool = False) tuple[DataFrame[Self], FailureInfo[Self]][source]

Filter the data frame by the rules of this schema.

This method can be thought of as a “soft alternative” to validate(). While validate() raises an exception when a row does not adhere to the rules defined in the schema, this method simply filters out these rows and succeeds.

Args:
df: The data frame to filter for valid rows. The data frame is collected

within this method, regardless of whether a DataFrame or LazyFrame is passed.

cast: Whether columns with a wrong data type in the input data frame are

cast to the schema’s defined data type if possible. Rows for which the cast fails for any column are filtered out.

Returns:

A tuple of the validated rows in the input data frame (potentially empty) and a simple dataclass carrying information about the rows of the data frame which could not be validated successfully. Just like in polars’ native filter(), the order of rows in the returned data frame is maintained.

Raises:
ValidationError: If the columns of the input data frame are invalid. This

happens only if the data frame misses a column defined in the schema or a column has an invalid dtype while cast is set to False.

Note:

This method preserves the ordering of the input data frame.

classmethod is_valid(df: DataFrame | LazyFrame, /, *, cast: bool = False) bool[source]

Utility method to check whether validate() raises an exception.

Args:

df: The data frame to check for validity. allow_extra_columns: Whether to allow the data frame to contain columns

that are not defined in the schema.

cast: Whether columns with a wrong data type in the input data frame are

cast to the schema’s defined data type before running validation. If set to False, a wrong data type will result in a return value of False.

Returns:

Whether the provided dataframe can be validated with this schema.

classmethod matches(other: type[Schema]) bool[source]

Check whether this schema semantically matches another schema.

This method checks whether the schemas have the same columns (with the same data types and constraints) as well as the same rules.

Args:

other: The schema to compare with.

Returns:

Whether the schemas are semantically equal.

classmethod polars_schema() Schema[source]

Obtain the polars schema for this schema.

Returns:

A polars schema that mirrors the schema defined by this class.

classmethod primary_keys() list[str][source]

The primary key columns in this schema (possibly empty).

classmethod pyarrow_schema() pa.Schema[source]

Obtain the pyarrow schema for this schema.

Returns:

A pyarrow schema that mirrors the schema defined by this class.

classmethod read_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) DataFrame[Self][source]

Read a Delta Lake table into a typed data frame with this schema.

Compared to polars.read_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.

Args:

source: Path or DeltaTable object from which to read the data. validation: The strategy for running validation when reading the data:

  • "allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True.

  • "warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary.

  • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

  • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it with polars.read_delta() to convey the purpose better_.

kwargs: Additional keyword arguments passed directly to polars.read_delta().

Returns:

The data frame with this schema.

Raises:

ValidationRequiredError: If no schema information can be read from the source and validation is set to "forbid".

Attention:

Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.

Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.

This method suffers from the same limitations as serialize().

classmethod read_parquet(source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes], *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) DataFrame[Self][source]

Read a parquet file into a typed data frame with this schema.

Compared to polars.read_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.

Args:

source: Path, directory, or file-like object from which to read the data. validation: The strategy for running validation when reading the data:

  • "allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True.

  • "warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary.

  • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

  • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it with polars.read_parquet() to convey the purpose better_.

kwargs: Additional keyword arguments passed directly to

polars.read_parquet().

Returns:

The data frame with this schema.

Raises:
ValidationRequiredError: If no schema information can be read from the

source and validation is set to "forbid".

Attention:

Be aware that this method suffers from the same limitations as serialize().

classmethod sample(num_rows: int | None = None, *, overrides: Mapping[str, Iterable[Any]] | Sequence[Mapping[str, Any]] | None = None, generator: Generator | None = None) DataFrame[Self][source]

Create a random data frame with a predefined number of rows.

Generally, this method should only be used for testing. Also, if you want to generate _realistic_ test data, it is inevitable to implement your custom sampling logic (by making use of the Generator class).

In order to allow for sampling random data frames in the presence of custom rules and primary key constraints, this method performs fuzzy sampling: it samples in a loop until it finds a data frame of length num_rows which adhere to the schema. The maximum number of sampling rounds is configured via max_sampling_iterations in the Config class. By fixing this setting to 1, it is only possible to reliably sample from schemas without custom rules and without primary key constraints.

Args:
num_rows: The (optional) number of rows to sample for creating the random

data frame. Must be provided (only) if no overrides are provided. If this is None, the number of rows in the data frame is determined by the length of the values in overrides.

overrides: Fixed values for a subset of the columns of the sampled data

frame. Just like when initializing a polars.DataFrame, overrides may either be provided as “column-” or “row-layout”, i.e. via a mapping or a list of mappings, respectively. The number of rows in the result data frame is equal to the length of the values in overrides. If both overrides and num_rows are provided, the length of the values in overrides must be equal to num_rows. The order of the items is guaranteed to match the ordering in the returned data frame. When providing values for a column, no sampling is performed for that column.

generator: The (seeded) generator to use for sampling data. If None, a

generator with random seed is automatically created.

Returns:

A data frame valid under the current schema with a number of rows that matches the length of the values in overrides or num_rows.

Raises:
ValueError: If num_rows is not equal to the length of the values in

overrides.

ValueError: If no valid data frame can be found in the configured maximum

number of iterations.

Attention:

Be aware that, due to sampling in a loop, the runtime of this method can be significant for complex schemas. Consider passing a seeded generator and evaluate whether the runtime impact in the tests is bearable. Alternatively, it can be beneficial to provide custom column overrides for columns associated with complex validation rules.

classmethod scan_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) LazyFrame[Self][source]

Lazily read a Delta Lake table into a typed data frame with this schema.

Compared to polars.scan_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.

Args:

source: Path or DeltaTable object from which to read the data. validation: The strategy for running validation when reading the data:

  • "allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True.

  • "warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary.

  • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

  • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it with polars.scan_delta() to convey the purpose better_.

kwargs: Additional keyword arguments passed directly to polars.scan_delta().

Returns:

The lazy data frame with this schema.

Raises:

ValidationRequiredError: If no schema information can be read from the source and validation is set to "forbid".

Attention:

Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.

Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.

This method suffers from the same limitations as serialize().

classmethod scan_parquet(source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes], *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) LazyFrame[Self][source]

Lazily read a parquet file into a typed data frame with this schema.

Compared to polars.scan_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.

Args:

source: Path, directory, or file-like object from which to read the data. validation: The strategy for running validation when reading the data:

  • "allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True.

  • "warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary.

  • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

  • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it with polars.scan_parquet() to convey the purpose better_.

kwargs: Additional keyword arguments passed directly to

polars.scan_parquet().

Returns:

The data frame with this schema.

Raises:
ValidationRequiredError: If no schema information can be read from the

source and validation is set to "forbid".

Note:

Due to current limitations in dataframely, this method actually reads the parquet file into memory if validation is "warn" or "allow" and validation is required.

Attention:

Be aware that this method suffers from the same limitations as serialize().

classmethod serialize() str[source]

Serialize this schema to a JSON string.

Returns:

The serialized schema.

Note:

Serialization within dataframely itself will remain backwards-compatible at least within a major version. Until further notice, it will also be backwards-compatible across major versions.

Attention:

Serialization of polars expressions is not guaranteed to be stable across versions of polars. This affects schemas that define custom rules or columns with custom checks: a schema serialized with one version of polars may not be deserializable with another version of polars.

Attention:

This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.

Raises:

TypeError: If any column contains metadata that is not JSON-serializable. ValueError: If any column is not a “native” dataframely column type but

a custom subclass.

classmethod sink_parquet(lf: LazyFrame[Self], /, file: str | Path | IO[bytes] | PartitioningScheme, **kwargs: Any) None[source]

Stream a typed lazy frame with this schema to a parquet file.

This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by read_parquet() and scan_parquet() for more efficient reading, or by external tools.

Args:

lf: The lazy frame to write to the parquet file. file: The file path, writable file-like object, or partitioning scheme to

which to write the parquet file.

kwargs: Additional keyword arguments passed directly to

polars.write_parquet(). metadata may only be provided if it is a dictionary.

Attention:

Be aware that this method suffers from the same limitations as serialize().

classmethod sql_schema(dialect: sa.Dialect) list[sa.Column][source]

Obtain the SQL schema for a particular dialect for this schema.

Args:
dialect: The dialect for which to obtain the SQL schema. Note that column

datatypes may differ across dialects.

Returns:

A list of sqlalchemy columns that can be used to create a table with the schema as defined by this class.

classmethod validate(df: DataFrame | LazyFrame, /, *, cast: bool = False) DataFrame[Self][source]

Validate that a data frame satisfies the schema.

Args:

df: The data frame to validate. cast: Whether columns with a wrong data type in the input data frame are

cast to the schema’s defined data type if possible.

Returns:

The (collected) input data frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence. The data frame is guaranteed to maintain its order.

Raises:
ValidationError: If the input data frame does not satisfy the schema

definition.

Note:

This method _always_ collects the input data frame in order to raise potential validation errors.

classmethod write_delta(df: DataFrame[Self], /, target: str | Path | deltalake.DeltaTable, **kwargs: Any) None[source]

Write a typed data frame with this schema to a Delta Lake table.

This method automatically adds a serialization of this schema to the Delta Lake table as metadata. The metadata can be leveraged by read_delta() and scan_delta() for efficient reading or by external tools.

Args:

df: The data frame to write to the Delta Lake table. target: The path or DeltaTable object to which to write the data. kwargs: Additional keyword arguments passed directly to polars.write_delta().

Attention:

This method suffers from the same limitations as serialize().

Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.

Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.

classmethod write_parquet(df: DataFrame[Self], /, file: str | Path | IO[bytes], **kwargs: Any) None[source]

Write a typed data frame with this schema to a parquet file.

This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by read_parquet() and scan_parquet() for more efficient reading, or by external tools.

Args:

df: The data frame to write to the parquet file. file: The file path or writable file-like object to which to write the

parquet file. This should be a path to a directory if writing a partitioned dataset.

kwargs: Additional keyword arguments passed directly to

polars.write_parquet(). metadata may only be provided if it is a dictionary.

Attention:

Be aware that this method suffers from the same limitations as serialize().

class dataframely.testing.typing.MyImportedSchema[source]

Bases: MyImportedBaseSchema

Methods

cast(df, /)

Cast a data frame to match the schema.

column_names()

The column names of this schema.

columns()

The column definitions of this schema.

create_empty(*[, lazy])

Create an empty data or lazy frame from this schema.

create_empty_if_none(df, *[, lazy])

Impute None input with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.

filter(df, /, *[, cast])

Filter the data frame by the rules of this schema.

is_valid(df, /, *[, cast])

Utility method to check whether validate() raises an exception.

matches(other)

Check whether this schema semantically matches another schema.

polars_schema()

Obtain the polars schema for this schema.

primary_keys()

The primary key columns in this schema (possibly empty).

pyarrow_schema()

Obtain the pyarrow schema for this schema.

read_delta(source, *[, validation])

Read a Delta Lake table into a typed data frame with this schema.

read_parquet(source, *[, validation])

Read a parquet file into a typed data frame with this schema.

sample([num_rows, overrides, generator])

Create a random data frame with a predefined number of rows.

scan_delta(source, *[, validation])

Lazily read a Delta Lake table into a typed data frame with this schema.

scan_parquet(source, *[, validation])

Lazily read a parquet file into a typed data frame with this schema.

serialize()

Serialize this schema to a JSON string.

sink_parquet(lf, /, file, **kwargs)

Stream a typed lazy frame with this schema to a parquet file.

sql_schema(dialect)

Obtain the SQL schema for a particular dialect for this schema.

validate(df, /, *[, cast])

Validate that a data frame satisfies the schema.

write_delta(df, /, target, **kwargs)

Write a typed data frame with this schema to a Delta Lake table.

write_parquet(df, /, file, **kwargs)

Write a typed data frame with this schema to a parquet file.

a = Int64(nullable=True)
b = Float32(nullable=True)
c = Enum(categories=['a', 'b', 'c'], nullable=True)
classmethod cast(df: DataFrame | LazyFrame, /) DataFrame[Self] | LazyFrame[Self][source]

Cast a data frame to match the schema.

This method removes superfluous columns and casts all schema columns to the correct dtypes. However, it does not introspect the data frame contents.

Hence, this method should be used with care and validate() should generally be preferred. It is advised to only use this method if df is surely known to adhere to the schema.

Returns:

The input data frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence.

Note:

If you only require a generic data frame for the type checker, consider using typing.cast() instead of this method.

Attention:

For lazy frames, casting is not performed eagerly. This prevents collecting the lazy frame’s schema but also means that a call to collect() further down the line might fail because of the cast and/or missing columns.

classmethod column_names() list[str][source]

The column names of this schema.

classmethod columns() dict[str, Column][source]

The column definitions of this schema.

classmethod create_empty(*, lazy: bool = False) DataFrame[Self] | LazyFrame[Self][source]

Create an empty data or lazy frame from this schema.

Args:
lazy: Whether to create a lazy data frame. If True, returns a lazy frame

with this Schema. Otherwise, returns an eager frame.

Returns:

An instance of polars.DataFrame or polars.LazyFrame with this schema’s defined columns and their data types.

classmethod create_empty_if_none(df: DataFrame[Self] | LazyFrame[Self] | None, *, lazy: bool = False) DataFrame[Self] | LazyFrame[Self][source]

Impute None input with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.

Args:
df: The data frame to check for None. If it is not None, it is

returned as lazy or eager frame. Otherwise, a schema-compliant data or lazy frame with no rows is returned.

lazy: Whether to return a lazy data frame. If True, returns a lazy frame

with this Schema. Otherwise, returns an eager frame.

Returns:

The given data frame df as lazy or eager frame, if it is not None. An instance of polars.DataFrame or polars.LazyFrame with this schema’s defined columns and their data types, but no rows, otherwise.

d = Struct(inner={'a': Int64(nullable=True), 'b': Struct(inner={'c': Enum(categories=['a', 'b'], nullable=True)}, nullable=True)}, nullable=True)
e = List(inner=Struct(inner={'a': Int64(nullable=True)}, nullable=True), nullable=True)
f = Datetime(nullable=True)
classmethod filter(df: DataFrame | LazyFrame, /, *, cast: bool = False) tuple[DataFrame[Self], FailureInfo[Self]][source]

Filter the data frame by the rules of this schema.

This method can be thought of as a “soft alternative” to validate(). While validate() raises an exception when a row does not adhere to the rules defined in the schema, this method simply filters out these rows and succeeds.

Args:
df: The data frame to filter for valid rows. The data frame is collected

within this method, regardless of whether a DataFrame or LazyFrame is passed.

cast: Whether columns with a wrong data type in the input data frame are

cast to the schema’s defined data type if possible. Rows for which the cast fails for any column are filtered out.

Returns:

A tuple of the validated rows in the input data frame (potentially empty) and a simple dataclass carrying information about the rows of the data frame which could not be validated successfully. Just like in polars’ native filter(), the order of rows in the returned data frame is maintained.

Raises:
ValidationError: If the columns of the input data frame are invalid. This

happens only if the data frame misses a column defined in the schema or a column has an invalid dtype while cast is set to False.

Note:

This method preserves the ordering of the input data frame.

g = Date(nullable=True)
h = Any()
classmethod is_valid(df: DataFrame | LazyFrame, /, *, cast: bool = False) bool[source]

Utility method to check whether validate() raises an exception.

Args:

df: The data frame to check for validity. allow_extra_columns: Whether to allow the data frame to contain columns

that are not defined in the schema.

cast: Whether columns with a wrong data type in the input data frame are

cast to the schema’s defined data type before running validation. If set to False, a wrong data type will result in a return value of False.

Returns:

Whether the provided dataframe can be validated with this schema.

classmethod matches(other: type[Schema]) bool[source]

Check whether this schema semantically matches another schema.

This method checks whether the schemas have the same columns (with the same data types and constraints) as well as the same rules.

Args:

other: The schema to compare with.

Returns:

Whether the schemas are semantically equal.

classmethod polars_schema() Schema[source]

Obtain the polars schema for this schema.

Returns:

A polars schema that mirrors the schema defined by this class.

classmethod primary_keys() list[str][source]

The primary key columns in this schema (possibly empty).

classmethod pyarrow_schema() pa.Schema[source]

Obtain the pyarrow schema for this schema.

Returns:

A pyarrow schema that mirrors the schema defined by this class.

classmethod read_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) DataFrame[Self][source]

Read a Delta Lake table into a typed data frame with this schema.

Compared to polars.read_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.

Args:

source: Path or DeltaTable object from which to read the data. validation: The strategy for running validation when reading the data:

  • "allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True.

  • "warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary.

  • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

  • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it with polars.read_delta() to convey the purpose better_.

kwargs: Additional keyword arguments passed directly to polars.read_delta().

Returns:

The data frame with this schema.

Raises:

ValidationRequiredError: If no schema information can be read from the source and validation is set to "forbid".

Attention:

Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.

Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.

This method suffers from the same limitations as serialize().

classmethod read_parquet(source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes], *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) DataFrame[Self][source]

Read a parquet file into a typed data frame with this schema.

Compared to polars.read_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.

Args:

source: Path, directory, or file-like object from which to read the data. validation: The strategy for running validation when reading the data:

  • "allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True.

  • "warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary.

  • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

  • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it with polars.read_parquet() to convey the purpose better_.

kwargs: Additional keyword arguments passed directly to

polars.read_parquet().

Returns:

The data frame with this schema.

Raises:
ValidationRequiredError: If no schema information can be read from the

source and validation is set to "forbid".

Attention:

Be aware that this method suffers from the same limitations as serialize().

classmethod sample(num_rows: int | None = None, *, overrides: Mapping[str, Iterable[Any]] | Sequence[Mapping[str, Any]] | None = None, generator: Generator | None = None) DataFrame[Self][source]

Create a random data frame with a predefined number of rows.

Generally, this method should only be used for testing. Also, if you want to generate _realistic_ test data, it is inevitable to implement your custom sampling logic (by making use of the Generator class).

In order to allow for sampling random data frames in the presence of custom rules and primary key constraints, this method performs fuzzy sampling: it samples in a loop until it finds a data frame of length num_rows which adhere to the schema. The maximum number of sampling rounds is configured via max_sampling_iterations in the Config class. By fixing this setting to 1, it is only possible to reliably sample from schemas without custom rules and without primary key constraints.

Args:
num_rows: The (optional) number of rows to sample for creating the random

data frame. Must be provided (only) if no overrides are provided. If this is None, the number of rows in the data frame is determined by the length of the values in overrides.

overrides: Fixed values for a subset of the columns of the sampled data

frame. Just like when initializing a polars.DataFrame, overrides may either be provided as “column-” or “row-layout”, i.e. via a mapping or a list of mappings, respectively. The number of rows in the result data frame is equal to the length of the values in overrides. If both overrides and num_rows are provided, the length of the values in overrides must be equal to num_rows. The order of the items is guaranteed to match the ordering in the returned data frame. When providing values for a column, no sampling is performed for that column.

generator: The (seeded) generator to use for sampling data. If None, a

generator with random seed is automatically created.

Returns:

A data frame valid under the current schema with a number of rows that matches the length of the values in overrides or num_rows.

Raises:
ValueError: If num_rows is not equal to the length of the values in

overrides.

ValueError: If no valid data frame can be found in the configured maximum

number of iterations.

Attention:

Be aware that, due to sampling in a loop, the runtime of this method can be significant for complex schemas. Consider passing a seeded generator and evaluate whether the runtime impact in the tests is bearable. Alternatively, it can be beneficial to provide custom column overrides for columns associated with complex validation rules.

classmethod scan_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) LazyFrame[Self][source]

Lazily read a Delta Lake table into a typed data frame with this schema.

Compared to polars.scan_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.

Args:

source: Path or DeltaTable object from which to read the data. validation: The strategy for running validation when reading the data:

  • "allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True.

  • "warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary.

  • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

  • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it with polars.scan_delta() to convey the purpose better_.

kwargs: Additional keyword arguments passed directly to polars.scan_delta().

Returns:

The lazy data frame with this schema.

Raises:

ValidationRequiredError: If no schema information can be read from the source and validation is set to "forbid".

Attention:

Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.

Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.

This method suffers from the same limitations as serialize().

classmethod scan_parquet(source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes], *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) LazyFrame[Self][source]

Lazily read a parquet file into a typed data frame with this schema.

Compared to polars.scan_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.

Args:

source: Path, directory, or file-like object from which to read the data. validation: The strategy for running validation when reading the data:

  • "allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True.

  • "warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary.

  • "forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema.

  • "skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it with polars.scan_parquet() to convey the purpose better_.

kwargs: Additional keyword arguments passed directly to

polars.scan_parquet().

Returns:

The data frame with this schema.

Raises:
ValidationRequiredError: If no schema information can be read from the

source and validation is set to "forbid".

Note:

Due to current limitations in dataframely, this method actually reads the parquet file into memory if validation is "warn" or "allow" and validation is required.

Attention:

Be aware that this method suffers from the same limitations as serialize().

classmethod serialize() str[source]

Serialize this schema to a JSON string.

Returns:

The serialized schema.

Note:

Serialization within dataframely itself will remain backwards-compatible at least within a major version. Until further notice, it will also be backwards-compatible across major versions.

Attention:

Serialization of polars expressions is not guaranteed to be stable across versions of polars. This affects schemas that define custom rules or columns with custom checks: a schema serialized with one version of polars may not be deserializable with another version of polars.

Attention:

This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.

Raises:

TypeError: If any column contains metadata that is not JSON-serializable. ValueError: If any column is not a “native” dataframely column type but

a custom subclass.

classmethod sink_parquet(lf: LazyFrame[Self], /, file: str | Path | IO[bytes] | PartitioningScheme, **kwargs: Any) None[source]

Stream a typed lazy frame with this schema to a parquet file.

This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by read_parquet() and scan_parquet() for more efficient reading, or by external tools.

Args:

lf: The lazy frame to write to the parquet file. file: The file path, writable file-like object, or partitioning scheme to

which to write the parquet file.

kwargs: Additional keyword arguments passed directly to

polars.write_parquet(). metadata may only be provided if it is a dictionary.

Attention:

Be aware that this method suffers from the same limitations as serialize().

some_decimal = Decimal(precision=12, scale=8, nullable=True)
classmethod sql_schema(dialect: sa.Dialect) list[sa.Column][source]

Obtain the SQL schema for a particular dialect for this schema.

Args:
dialect: The dialect for which to obtain the SQL schema. Note that column

datatypes may differ across dialects.

Returns:

A list of sqlalchemy columns that can be used to create a table with the schema as defined by this class.

classmethod validate(df: DataFrame | LazyFrame, /, *, cast: bool = False) DataFrame[Self][source]

Validate that a data frame satisfies the schema.

Args:

df: The data frame to validate. cast: Whether columns with a wrong data type in the input data frame are

cast to the schema’s defined data type if possible.

Returns:

The (collected) input data frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence. The data frame is guaranteed to maintain its order.

Raises:
ValidationError: If the input data frame does not satisfy the schema

definition.

Note:

This method _always_ collects the input data frame in order to raise potential validation errors.

classmethod write_delta(df: DataFrame[Self], /, target: str | Path | deltalake.DeltaTable, **kwargs: Any) None[source]

Write a typed data frame with this schema to a Delta Lake table.

This method automatically adds a serialization of this schema to the Delta Lake table as metadata. The metadata can be leveraged by read_delta() and scan_delta() for efficient reading or by external tools.

Args:

df: The data frame to write to the Delta Lake table. target: The path or DeltaTable object to which to write the data. kwargs: Additional keyword arguments passed directly to polars.write_delta().

Attention:

This method suffers from the same limitations as serialize().

Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.

Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.

classmethod write_parquet(df: DataFrame[Self], /, file: str | Path | IO[bytes], **kwargs: Any) None[source]

Write a typed data frame with this schema to a parquet file.

This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by read_parquet() and scan_parquet() for more efficient reading, or by external tools.

Args:

df: The data frame to write to the parquet file. file: The file path or writable file-like object to which to write the

parquet file. This should be a path to a directory if writing a partitioned dataset.

kwargs: Additional keyword arguments passed directly to

polars.write_parquet(). metadata may only be provided if it is a dictionary.

Attention:

Be aware that this method suffers from the same limitations as serialize().