dataframely.failure module¶
- class dataframely.failure.FailureInfo(lf: LazyFrame, rule_columns: list[str], schema: type[S])[source]¶
Bases:
Generic[S]A container carrying information about rows failing validation in
Schema.filter().Methods
The number of validation failures per co-occurring rule validation failure.
counts()The number of validation failures for each individual rule.
invalid()The rows of the original data frame containing the invalid rows.
read_delta(source, **kwargs)Read a delta lake table with the failure info.
read_parquet(source, **kwargs)Read a parquet file with the failure info.
scan_delta(source, **kwargs)Lazily read a delta lake table with the failure info.
scan_parquet(source, **kwargs)Lazily read a parquet file with the failure info.
sink_parquet(file, **kwargs)Stream the failure info to a parquet file.
write_delta(target, **kwargs)Write the failure info to a delta lake table.
write_parquet(file, **kwargs)Write the failure info to a parquet file.
- cooccurrence_counts() dict[frozenset[str], int][source]¶
The number of validation failures per co-occurring rule validation failure.
In contrast to
counts(), this method provides additional information on whether a rule often fails because of another rule failing.- Returns:
A list providing tuples of (1) co-occurring rule validation failures and (2) the count of such failures.
- Attention:
This method should primarily be used for debugging as it is much slower than
counts().
- counts() dict[str, int][source]¶
The number of validation failures for each individual rule.
- Returns:
A mapping from rule name to counts. If a rule’s failure count is 0, it is not included here.
- classmethod read_delta(source: str | Path | deltalake.DeltaTable, **kwargs: Any) FailureInfo[Schema][source]¶
Read a delta lake table with the failure info.
- Args:
source: Path or DeltaTable from which to read the data. kwargs: Additional keyword arguments passed directly to
polars.read_delta().- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- classmethod read_parquet(source: str | Path | IO[bytes], **kwargs: Any) FailureInfo[Schema][source]¶
Read a parquet file with the failure info.
- Args:
source: Path, directory, or file-like object from which to read the data. kwargs: Additional keyword arguments passed directly to
polars.read_parquet().- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize()
- classmethod scan_delta(source: str | Path | deltalake.DeltaTable, **kwargs: Any) FailureInfo[Schema][source]¶
Lazily read a delta lake table with the failure info.
- Args:
source: Path or DeltaTable from which to read the data. kwargs: Additional keyword arguments passed directly to
polars.scan_delta().- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- classmethod scan_parquet(source: str | Path | IO[bytes], **kwargs: Any) FailureInfo[Schema][source]¶
Lazily read a parquet file with the failure info.
- Args:
source: Path, directory, or file-like object from which to read the data.
- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize()
- schema: type[S]¶
The schema used to create the input data frame.
- sink_parquet(file: str | Path | IO[bytes] | PartitioningScheme, **kwargs: Any) None[source]¶
Stream the failure info to a parquet file.
- Args:
- file: The file path or writable file-like object to which to write the
parquet file. This should be a path to a directory if writing a partitioned dataset.
- kwargs: Additional keyword arguments passed directly to
polars.sink_parquet().metadatamay only be provided if it is a dictionary.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- write_delta(target: str | Path | deltalake.DeltaTable, **kwargs: Any) None[source]¶
Write the failure info to a delta lake table.
- Args:
target: The file path or DeltaTable to which to write the delta lake data. kwargs: Additional keyword arguments passed directly to
polars.write_delta().- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- write_parquet(file: str | Path | IO[bytes], **kwargs: Any) None[source]¶
Write the failure info to a parquet file.
- Args:
- file: The file path or writable file-like object to which to write the
parquet file. This should be a path to a directory if writing a partitioned dataset.
- kwargs: Additional keyword arguments passed directly to
polars.write_parquet().metadatamay only be provided if it is a dictionary.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().