dataframely.failure module

class dataframely.failure.FailureInfo(lf: LazyFrame, rule_columns: list[str], schema: type[S])[source]

Bases: Generic[S]

A container carrying information about rows failing validation in Schema.filter().

Methods

cooccurrence_counts()

The number of validation failures per co-occurring rule validation failure.

counts()

The number of validation failures for each individual rule.

invalid()

The rows of the original data frame containing the invalid rows.

read_delta(source, **kwargs)

Read a delta lake table with the failure info.

read_parquet(source, **kwargs)

Read a parquet file with the failure info.

scan_delta(source, **kwargs)

Lazily read a delta lake table with the failure info.

scan_parquet(source, **kwargs)

Lazily read a parquet file with the failure info.

sink_parquet(file, **kwargs)

Stream the failure info to a parquet file.

write_delta(target, **kwargs)

Write the failure info to a delta lake table.

write_parquet(file, **kwargs)

Write the failure info to a parquet file.

cooccurrence_counts() dict[frozenset[str], int][source]

The number of validation failures per co-occurring rule validation failure.

In contrast to counts(), this method provides additional information on whether a rule often fails because of another rule failing.

Returns:

A list providing tuples of (1) co-occurring rule validation failures and (2) the count of such failures.

Attention:

This method should primarily be used for debugging as it is much slower than counts().

counts() dict[str, int][source]

The number of validation failures for each individual rule.

Returns:

A mapping from rule name to counts. If a rule’s failure count is 0, it is not included here.

invalid() DataFrame[source]

The rows of the original data frame containing the invalid rows.

classmethod read_delta(source: str | Path | deltalake.DeltaTable, **kwargs: Any) FailureInfo[Schema][source]

Read a delta lake table with the failure info.

Args:

source: Path or DeltaTable from which to read the data. kwargs: Additional keyword arguments passed directly to

polars.read_delta().

Returns:

The failure info object.

Raises:

ValueError: If no appropriate metadata can be found.

Attention:

Be aware that this method suffers from the same limitations as Schema.serialize().

classmethod read_parquet(source: str | Path | IO[bytes], **kwargs: Any) FailureInfo[Schema][source]

Read a parquet file with the failure info.

Args:

source: Path, directory, or file-like object from which to read the data. kwargs: Additional keyword arguments passed directly to

polars.read_parquet().

Returns:

The failure info object.

Raises:

ValueError: If no appropriate metadata can be found.

Attention:

Be aware that this method suffers from the same limitations as Schema.serialize()

classmethod scan_delta(source: str | Path | deltalake.DeltaTable, **kwargs: Any) FailureInfo[Schema][source]

Lazily read a delta lake table with the failure info.

Args:

source: Path or DeltaTable from which to read the data. kwargs: Additional keyword arguments passed directly to

polars.scan_delta().

Returns:

The failure info object.

Raises:

ValueError: If no appropriate metadata can be found.

Attention:

Be aware that this method suffers from the same limitations as Schema.serialize().

classmethod scan_parquet(source: str | Path | IO[bytes], **kwargs: Any) FailureInfo[Schema][source]

Lazily read a parquet file with the failure info.

Args:

source: Path, directory, or file-like object from which to read the data.

Returns:

The failure info object.

Raises:

ValueError: If no appropriate metadata can be found.

Attention:

Be aware that this method suffers from the same limitations as Schema.serialize()

schema: type[S]

The schema used to create the input data frame.

sink_parquet(file: str | Path | IO[bytes] | PartitioningScheme, **kwargs: Any) None[source]

Stream the failure info to a parquet file.

Args:
file: The file path or writable file-like object to which to write the

parquet file. This should be a path to a directory if writing a partitioned dataset.

kwargs: Additional keyword arguments passed directly to

polars.sink_parquet(). metadata may only be provided if it is a dictionary.

Attention:

Be aware that this method suffers from the same limitations as Schema.serialize().

write_delta(target: str | Path | deltalake.DeltaTable, **kwargs: Any) None[source]

Write the failure info to a delta lake table.

Args:

target: The file path or DeltaTable to which to write the delta lake data. kwargs: Additional keyword arguments passed directly to

polars.write_delta().

Attention:

Be aware that this method suffers from the same limitations as Schema.serialize().

write_parquet(file: str | Path | IO[bytes], **kwargs: Any) None[source]

Write the failure info to a parquet file.

Args:
file: The file path or writable file-like object to which to write the

parquet file. This should be a path to a directory if writing a partitioned dataset.

kwargs: Additional keyword arguments passed directly to

polars.write_parquet(). metadata may only be provided if it is a dictionary.

Attention:

Be aware that this method suffers from the same limitations as Schema.serialize().