Migrating from v1 to v2#

Dataframely v2 introduces several improvements and some breaking changes to streamline the API.

Improvements#

Lazy Validation#

Dataframely v2 finally implements lazy validation and filtering using a custom polars plugin. This allows Schema.validate() and Schema.filter() to be used within lazy computation graphs instead of forcing a collect. More details can be found in the dedicated guide.

Lazy `scan` operations#

With lazy validation, all scan_* methods (e.g. Schema.scan_parquet(), Collection.scan_delta(), …) are now truly lazy, even if validation is necessary. Previously, this required collecting the input and running validation eagerly.

S3 Support in all I/O functions#

Dataframely v2 now properly supports S3 for all I/O functions (i.e. write_*, sink_*, read_*, scan_*).

Breaking Changes#

Columns are non-nullable by default#

In dataframely v1, specifying a column without setting the nullable property caused the column to be nullable and a warning was emitted. In dataframely v2, this changes: no warning is emitted anymore and nullable defaults to False. This mirrors the typical expectation that a column is not nullable (because null values often indicate issues) – nullability now becomes opt-in.

Primary key columns may not be nullable#

While dataframely v1 merely emitted a warning, dataframely v2 now raises an exception if a primary key is designated as nullable. This aligns dataframely, for example, with SQL where primary key columns may not be nullable.

Schema rules are now defined as classmethods#

In order to allow schema rules to access information about the schema and, especially, information of a schema’s subclasses, schema rules must now be specified as classmethods. This means:

class MySchema(dy.Schema):
    ...

    @dy.rule()
    def my_rule() -> pl.Expr:
        ...

turns into

class MySchema(dy.Schema):
    ...

    @dy.rule()
    def my_rule(cls) -> pl.Expr:
        ...

Within the schema rule, cls can now be used to access columns or other information from the schema. Specifically,

class MySchema(dy.Schema):
    a = dy.Integer()
    b = dy.Integer()

    @dy.rule()
    def my_rule() -> pl.Expr:
        return MySchema.a.col >= MySchema.b.col

can now be written as

class MySchema(dy.Schema):
    a = dy.Integer()
    b = dy.Integer()

    @dy.rule()
    def my_rule(cls) -> pl.Expr:
        return cls.a.col >= cls.b.col

To migrate your existing code without changing behavior, simply add the cls argument to the signature of your rules. If you are using ruff, you will need to add the following to your pyproject.toml for ruff to recognize @dy.rule as a decorator that turns a method into a classmethod:

[tool.ruff.lint.pep8-naming]
classmethod-decorators = ["dataframely.rule"]

Predefined checks for floats are updated#

For floating point types (Float, Float32, Float64), the allow_inf_nan option has been split into allow_inf and allow_nan, allowing to set these to be set independently. Note that the defaults remain the same, i.e., if allow_inf_nan wasn’t set before, nothing changes.

Schema conversion functions are renamed#

The methods that allow converting a dataframely Schema into a schema of another package have been renamed to better align with the naming scheme of conversion functions in other packages:

sql_schema → to_sqlalchemy_columns
pyarrow_schema → to_pyarrow_schema
polars_schema → to_polars_schema

Utility functions for collection filters are renamed and safer#

For writing collection filters, dataframely exposes two utility functions to express the 1:1 and 1:{1,N} relationships between members. These have been renamed as follows:

filter_relationship_one_to_one → require_relationship_one_to_one
filter_relationship_one_to_at_least_one → require_relationship_one_to_at_least_one

Additionally, their behavior changes: even if primary key constraints are not enforced on the schema, the method now behaves correctly. Previously, the validation result could duplicate input rows.

If the relationships are already enforced by primary key constraints on the schemas[1], you can still specify drop_duplicates=False. This returns to the previous behavior and allows for considerable performance improvements.

Collection metadata cannot be read from `schema.json` anymore#

Prior to dataframely v1.8.0, collection metadata has been serialized as a schema.json file when calling write_parquet or scan_parquet on a collection. Since dataframely v1.8.0, the metadata has been moved to the individual members’ parquet metadata.

While dataframely v1 still supported reading the metadata from collections written with a version of dataframely prior to v1.8.0, dataframely v2 removes this support. If you still have data written with a version of dataframely earlier than v1.8.0, and, thus, still have schema.json files, you can migrate your data by reading it and writing it back to disk with any version of dataframely >=1.8.0,<2.

The mypy plugin is removed entirely#

The mypy plugin in dataframely v1 had two purposes:

Ensure that a method with @dy.rule decorator is recognized as a rule
Turn non-specific return types into ones with enriched type information (e.g. dict → TypedDict)

Unfortunately, the latter was error-prone as it yielded many false positives and generally made working with these types less ergonomic. We therefore actively removed this part. With @dy.rule being applied to classmethods, the need for a custom mypy plugin is eliminated entirely. As a result, dataframely.mypy has been removed.

If you have used the mypy plugin before, you can remove the following from your pyproject.toml:

[tool.mypy]
plugins = ["dataframely.mypy"]