Migrating from v1 to v2#
Dataframely v2 introduces several improvements and some breaking changes to streamline the API.
Improvements#
Lazy Validation#
Dataframely v2 finally implements lazy validation and filtering using a custom polars plugin. This allows
Schema.validate() and Schema.filter() to be used within lazy computation graphs instead of forcing a
collect. More details can be found in the dedicated guide.
Lazy scan operations#
With lazy validation, all scan_* methods (e.g. Schema.scan_parquet(), Collection.scan_delta(), …) are
now truly lazy, even if validation is necessary. Previously, this required collecting the input and running validation
eagerly.
S3 Support in all I/O functions#
Dataframely v2 now properly supports S3 for all I/O functions (i.e. write_*, sink_*, read_*, scan_*).
Breaking Changes#
Columns are non-nullable by default#
In dataframely v1, specifying a column without setting the nullable property caused the column to be nullable and a
warning was emitted. In dataframely v2, this changes: no warning is emitted anymore and nullable defaults to False.
This mirrors the typical expectation that a column is not nullable (because null values often indicate issues) –
nullability now becomes opt-in.
Primary key columns may not be nullable#
While dataframely v1 merely emitted a warning, dataframely v2 now raises an exception if a primary key is designated as nullable. This aligns dataframely, for example, with SQL where primary key columns may not be nullable.
Schema rules are now defined as classmethods#
In order to allow schema rules to access information about the schema and, especially, information of a schema’s subclasses, schema rules must now be specified as classmethods. This means:
class MySchema(dy.Schema):
...
@dy.rule()
def my_rule() -> pl.Expr:
...
turns into
class MySchema(dy.Schema):
...
@dy.rule()
def my_rule(cls) -> pl.Expr:
...
Within the schema rule, cls can now be used to access columns or other information from the schema. Specifically,
class MySchema(dy.Schema):
a = dy.Integer()
b = dy.Integer()
@dy.rule()
def my_rule() -> pl.Expr:
return MySchema.a.col >= MySchema.b.col
can now be written as
class MySchema(dy.Schema):
a = dy.Integer()
b = dy.Integer()
@dy.rule()
def my_rule(cls) -> pl.Expr:
return cls.a.col >= cls.b.col
To migrate your existing code without changing behavior, simply add the cls argument to the signature of your rules.
If you are using ruff, you will need to add the following to your pyproject.toml for
ruff to recognize @dy.rule as a decorator that turns a method into a classmethod:
[tool.ruff.lint.pep8-naming]
classmethod-decorators = ["dataframely.rule"]
Predefined checks for floats are updated#
For floating point types (Float, Float32, Float64),
the allow_inf_nan option has been split into allow_inf and allow_nan, allowing to set these to be set
independently. Note that the defaults remain the same, i.e., if allow_inf_nan wasn’t set before, nothing changes.
Schema conversion functions are renamed#
The methods that allow converting a dataframely Schema into a schema of another package have been
renamed to better align with the naming scheme of conversion functions in other packages:
sql_schema→to_sqlalchemy_columnspyarrow_schema→to_pyarrow_schemapolars_schema→to_polars_schema
Utility functions for collection filters are renamed and safer#
For writing collection filters, dataframely exposes two utility functions to express the 1:1 and 1:{1,N}
relationships between members. These have been renamed as follows:
filter_relationship_one_to_one→require_relationship_one_to_onefilter_relationship_one_to_at_least_one→require_relationship_one_to_at_least_one
Additionally, their behavior changes: even if primary key constraints are not enforced on the schema, the method now behaves correctly. Previously, the validation result could duplicate input rows.
If the relationships are already enforced by primary key constraints on the schemas[1], you can still specify
drop_duplicates=False. This returns to the previous behavior and allows for considerable performance improvements.
Collection metadata cannot be read from schema.json anymore#
Prior to dataframely v1.8.0, collection metadata has been serialized as a schema.json file when calling
write_parquet or scan_parquet on a collection. Since dataframely v1.8.0, the metadata has been moved to the
individual members’ parquet metadata.
While dataframely v1 still supported reading the metadata from collections written with a version of dataframely prior
to v1.8.0, dataframely v2 removes this support. If you still have data written with a version of dataframely earlier
than v1.8.0, and, thus, still have schema.json files, you can migrate your data by reading it and writing it back to
disk with any version of dataframely >=1.8.0,<2.
The mypy plugin is removed entirely#
The mypy plugin in dataframely v1 had two purposes:
Ensure that a method with
@dy.ruledecorator is recognized as a ruleTurn non-specific return types into ones with enriched type information (e.g.
dict→TypedDict)
Unfortunately, the latter was error-prone as it yielded many false positives and generally made working with these
types less ergonomic. We therefore actively removed this part. With @dy.rule being applied to classmethods, the need
for a custom mypy plugin is eliminated entirely. As a result, dataframely.mypy has been removed.
If you have used the mypy plugin before, you can remove the following from your pyproject.toml:
[tool.mypy]
plugins = ["dataframely.mypy"]