dataframely.failure module¶

class dataframely.failure.FailureInfo(lf: LazyFrame, rule_columns: list[str], schema: type[S])[source]¶

Bases: Generic[S]

A container carrying information about rows failing validation in Schema.filter().

Methods

`cooccurrence_counts`()	The number of validation failures per co-occurring rule validation failure.
`counts`()	The number of validation failures for each individual rule.
`invalid`()	The rows of the original data frame containing the invalid rows.
`read_delta`(source, **kwargs)	Read a delta lake table with the failure info.
`read_parquet`(source, **kwargs)	Read a parquet file with the failure info.
`scan_delta`(source, **kwargs)	Lazily read a delta lake table with the failure info.
`scan_parquet`(source, **kwargs)	Lazily read a parquet file with the failure info.
`sink_parquet`(file, **kwargs)	Stream the failure info to a parquet file.
`write_delta`(target, **kwargs)	Write the failure info to a delta lake table.
`write_parquet`(file, **kwargs)	Write the failure info to a parquet file.

cooccurrence_counts() → dict[frozenset[str], int][source]¶

The number of validation failures per co-occurring rule validation failure.

In contrast to counts(), this method provides additional information on whether a rule often fails because of another rule failing.

Returns:: A list providing tuples of (1) co-occurring rule validation failures and (2) the count of such failures.
Attention:: This method should primarily be used for debugging as it is much slower than counts().

counts() → dict[str, int][source]¶

The number of validation failures for each individual rule.

Returns:: A mapping from rule name to counts. If a rule’s failure count is 0, it is not included here.

invalid() → DataFrame[source]¶: The rows of the original data frame containing the invalid rows.

classmethod read_delta(source: str | Path | deltalake.DeltaTable, **kwargs: Any) → FailureInfo[Schema][source]¶

Read a delta lake table with the failure info.

Args:: source: Path or DeltaTable from which to read the data. kwargs: Additional keyword arguments passed directly to

polars.read_delta().
Returns:: The failure info object.
Raises:: ValueError: If no appropriate metadata can be found.
Attention:: Be aware that this method suffers from the same limitations as Schema.serialize().

classmethod read_parquet(source: str | Path | IO[bytes], **kwargs: Any) → FailureInfo[Schema][source]¶

Read a parquet file with the failure info.

Args:: source: Path, directory, or file-like object from which to read the data. kwargs: Additional keyword arguments passed directly to

polars.read_parquet().
Returns:: The failure info object.
Raises:: ValueError: If no appropriate metadata can be found.
Attention:: Be aware that this method suffers from the same limitations as Schema.serialize()

classmethod scan_delta(source: str | Path | deltalake.DeltaTable, **kwargs: Any) → FailureInfo[Schema][source]¶

Lazily read a delta lake table with the failure info.

Args:: source: Path or DeltaTable from which to read the data. kwargs: Additional keyword arguments passed directly to

polars.scan_delta().
Returns:: The failure info object.
Raises:: ValueError: If no appropriate metadata can be found.
Attention:: Be aware that this method suffers from the same limitations as Schema.serialize().

classmethod scan_parquet(source: str | Path | IO[bytes], **kwargs: Any) → FailureInfo[Schema][source]¶

Lazily read a parquet file with the failure info.

Args:: source: Path, directory, or file-like object from which to read the data.
Returns:: The failure info object.
Raises:: ValueError: If no appropriate metadata can be found.
Attention:: Be aware that this method suffers from the same limitations as Schema.serialize()

sink_parquet(file: str | Path | IO[bytes] | PartitioningScheme, **kwargs: Any) → None[source]¶

Stream the failure info to a parquet file.

Args:

file: The file path or writable file-like object to which to write the: parquet file. This should be a path to a directory if writing a partitioned dataset.
kwargs: Additional keyword arguments passed directly to: polars.sink_parquet(). metadata may only be provided if it is a dictionary.

Attention:

Be aware that this method suffers from the same limitations as Schema.serialize().

write_delta(target: str | Path | deltalake.DeltaTable, **kwargs: Any) → None[source]¶

Write the failure info to a delta lake table.

Args:: target: The file path or DeltaTable to which to write the delta lake data. kwargs: Additional keyword arguments passed directly to

polars.write_delta().
Attention:: Be aware that this method suffers from the same limitations as Schema.serialize().

write_parquet(file: str | Path | IO[bytes], **kwargs: Any) → None[source]¶

Write the failure info to a parquet file.

Args:

file: The file path or writable file-like object to which to write the: parquet file. This should be a path to a directory if writing a partitioned dataset.
kwargs: Additional keyword arguments passed directly to: polars.write_parquet(). metadata may only be provided if it is a dictionary.

Attention:

Be aware that this method suffers from the same limitations as Schema.serialize().