dataframely package¶
- class dataframely.Any(*, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ColumnA column with arbitrary type.
As a column with arbitrary type is commonly mapped to the
Nulltype (this is the default inpolarsandpyarrowfor empty columns), dataframely also requires this column to be nullable. Hence, it cannot be used as a primary key.- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Array(inner: Column, shape: int | tuple[int, ...], *, nullable: bool = True, primary_key: Literal[False] = False, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ColumnA fixed-shape array column.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Binary(*, nullable: bool | None = None, primary_key: bool = False, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ColumnA column of binary values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Bool(*, nullable: bool | None = None, primary_key: bool = False, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ColumnA column of booleans.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Categorical(*, nullable: bool | None = None, primary_key: bool = False, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ColumnA column of categorical (string) values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Collection[source]¶
Bases:
BaseCollection,ABCBase class for all collections of data frames with a predefined schema.
A collection is comprised of a set of members which are collectively “consistent”, meaning they the collection ensures that invariants are held up across members. This is different to
dataframelyschemas which only ensure invariants within individual members.In order to properly ensure that invariants hold up across members, members must have a “common primary key”, i.e. there must be an overlap of at least one primary key column across all members. Consequently, a collection is typically used to represent “semantic objects” which cannot be represented in a single data frame due to 1-N relationships that are managed in separate data frames.
A collection must only have type annotations for :class:`~dataframely.LazyFrame`s with known schema:
class MyCollection(dy.Collection): first_member: dy.LazyFrame[MyFirstSchema] second_member: dy.LazyFrame[MySecondSchema]
Besides, it may define filters (c.f.
filter()) and arbitrary methods.- Note:
The
dataframelymypy plugin ensures that the dictionaries passed to class methods contain exactly the required keys.- Attention:
Do NOT use this class in combination with
from __future__ import annotationsas it requires the proper schema definitions to ensure that the collection is implemented correctly.
Methods
cast(data, /)Initialize a collection by casting all members into their correct schemas.
Collect all members of the collection.
The primary keys shared by non ignored members of the collection.
Create an empty collection without any data.
filter(data, /, *[, cast])Filter the members data frame by their schemas and the collection's filters.
The names of all members of the collection that are ignored in filters.
is_valid(data, /, *[, cast])Utility method to check whether
validate()raises an exception.join(primary_keys[, how, maintain_order])Filter the collection by joining onto a data frame containing entries for the common primary key columns whose respective rows should be kept or removed in the collection members.
matches(other)Check whether this collection semantically matches another.
The schemas of all members of the collection.
members()Information about the members of the collection.
The names of all members of the collection that are not ignored in filters (default).
The names of all optional members of the collection.
read_delta(source, *[, validation])Read all collection members from Delta Lake tables.
read_parquet(directory, *[, validation])Read all collection members from parquet files in a directory.
The names of all required members of the collection.
sample([num_rows, overrides, generator])Create a random sample from the members of this collection.
scan_delta(source, *[, validation])Lazily read all collection members from Delta Lake tables.
scan_parquet(directory, *[, validation])Lazily read all collection members from parquet files in a directory.
Serialize this collection to a JSON string.
sink_parquet(directory, **kwargs)Stream the members of this collection into parquet files in a directory.
to_dict()Return a dictionary representation of this collection.
validate(data, /, *[, cast])Validate that a set of data frames satisfy the collection's invariants.
write_delta(target, **kwargs)Write the members of this collection to Delta Lake tables.
write_parquet(directory, **kwargs)Write the members of this collection to parquet files in a directory.
- classmethod cast(data: Mapping[str, FrameType], /) Self[source]¶
Initialize a collection by casting all members into their correct schemas.
This method calls
cast()on every member, thus, removing superfluous columns and casting to the correct dtypes for all input data frames.You should typically use
validate()orfilter()to obtain instances of the collection as this method does not guarantee that the returned collection upholds any invariants. Nonetheless, it may be useful to use in instances where it is known that the provided data adheres to the collection’s invariants.- Args:
- data: The data for all members. The dictionary must contain exactly one
entry per member with the name of the member as key.
- Returns:
The initialized collection.
- Raises:
- ValueError: If an insufficient set of input data frames is provided, i.e. if
any required member of this collection is missing in the input.
- Attention:
For lazy frames, casting is not performed eagerly. This prevents collecting the lazy frames’ schemas but also means that a call to
collect()further down the line might fail because of the cast and/or missing columns.
- collect_all() Self[source]¶
Collect all members of the collection.
This method collects all members in parallel for maximum efficiency. It is particularly useful when
filter()is called with lazy frame inputs.- Returns:
The same collection with all members collected once.
- Note:
As all collection members are required to be lazy frames, the returned collection’s members are still “lazy”. However, they are “shallow-lazy”, meaning they are obtained by calling
.collect().lazy().
- classmethod common_primary_keys() list[str][source]¶
The primary keys shared by non ignored members of the collection.
- classmethod create_empty() Self[source]¶
Create an empty collection without any data.
This method simply calls
create_emptyon all member schemas, including non-optional ones.- Returns:
An instance of this collection.
- classmethod filter(data: Mapping[str, FrameType], /, *, cast: bool = False) tuple[Self, dict[str, FailureInfo]][source]¶
Filter the members data frame by their schemas and the collection’s filters.
- Args:
- data: The members of the collection which ought to be filtered. The
dictionary must contain exactly one entry per member with the name of the member as key, except for optional members which may be missing. All data frames passed here will be eagerly collected within the method, regardless of whether they are a
DataFrameorLazyFrame.- cast: Whether columns with a wrong data type in the member data frame are
cast to their schemas’ defined data types if possible.
- Returns:
A tuple of two items:
An instance of the collection which contains a subset of each of the input data frames with the rows which passed member-wise validation and were not filtered out by any of the collection’s filters. While collection members are always instances of
LazyFrame, the members of the returned collection are essentially eager as they are constructed by calling.lazy()on eager data frames. Just like in polars’ nativefilter(), the order of rows is maintained in all returned data frames.A mapping from member name to a
FailureInfoobject which provides details on why individual rows had been removed. Optional members are only included in this dictionary if they had been provided in the input.
- Raises:
- ValueError: If an insufficient set of input data frames is provided, i.e. if
any required member of this collection is missing in the input.
- ValidationError: If the columns of any of the input data frames are invalid.
This happens only if a data frame misses a column defined in its schema or a column has an invalid dtype while
castis set toFalse.
- classmethod ignored_members() set[str][source]¶
The names of all members of the collection that are ignored in filters.
- classmethod is_valid(data: Mapping[str, FrameType], /, *, cast: bool = False) bool[source]¶
Utility method to check whether
validate()raises an exception.- Args:
- data: The members of the collection which ought to be validated. The
dictionary must contain exactly one entry per member with the name of the member as key. The existence of all keys is checked via the
dataframelymypy plugin.- cast: Whether columns with a wrong data type in the member data frame are
cast to their schemas’ defined data types if possible.
- Returns:
Whether the provided members satisfy the invariants of the collection.
- Raises:
- ValueError: If an insufficient set of input data frames is provided, i.e. if
any required member of this collection is missing in the input.
- join(primary_keys: LazyFrame, how: Literal['semi', 'anti'] = 'semi', maintain_order: Literal['none', 'left'] = 'none') Self[source]¶
Filter the collection by joining onto a data frame containing entries for the common primary key columns whose respective rows should be kept or removed in the collection members.
- Args:
- primary_keys: The data frame to join on. Must contain the common primary key
columns of the collection.
- how: The join strategy to use. Like in polars, semi will keep all rows
that can be found in primary_keys, anti will remove them.
maintain_order: The maintain_order option to use for the polars join.
- Returns:
The collection, with members potentially reduced in length.
- Raises:
- ValueError: If the collection contains any member that is annotated with
ignored_in_filters=True.
- Attention:
This method does not validate the resulting collection. Ensure to only use this if the resulting collection still satisfies the filters of the collection. The joins are not evaluated eagerly. Therefore, a downstream call to
collect()might fail, especially if primary_keys does not contain all columns for all common primary keys.
- classmethod matches(other: type[Collection]) bool[source]¶
Check whether this collection semantically matches another.
- Args:
other: The collection to compare with.
- Returns:
Whether the two collections are semantically equal.
- Attention:
For custom filters, reliable comparison results are only guaranteed if the filter always returns a static polars expression. Otherwise, this function may falsely indicate a match.
- classmethod member_schemas() dict[str, type[Schema]][source]¶
The schemas of all members of the collection.
- classmethod members() dict[str, MemberInfo][source]¶
Information about the members of the collection.
- classmethod non_ignored_members() set[str][source]¶
The names of all members of the collection that are not ignored in filters (default).
- classmethod optional_members() set[str][source]¶
The names of all optional members of the collection.
- classmethod read_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) Self[source]¶
Read all collection members from Delta Lake tables.
This method reads each member from a Delta Lake table at the provided source location. The source can be a path, URI, or an existing DeltaTable object. Optional members are only read if present.
- Args:
source: The location or DeltaTable to read from. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the schema data from the parquet files. If the stored collection schema matches this collection schema, the collection is read without validation. If the stored schema mismatches this schema no metadata can be found in the parquets, or the files have conflicting metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the metadata stores a collection schema that matches this collection."skip": The method never runs validation and simply reads the data, entrusting the user that the schema is valid. _Use this option carefully_.
kwargs: Additional keyword arguments passed directly to
polars.read_delta().- Returns:
The initialized collection.
- Raises:
ValidationRequiredError: If no collection schema can be read from the source and
validationis set to"forbid". ValueError: If the provided source does not contain Delta tables for all required members. ValidationError: If the collection cannot be validated.- Attention:
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
Be aware that this method suffers from the same limitations as
serialize().
- classmethod read_parquet(directory: str | Path, *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) Self[source]¶
Read all collection members from parquet files in a directory.
This method searches for files named
<member>.parquetin the provided directory for all required and optional members of the collection.- Args:
- directory: The directory where the Parquet files should be read from.
Parquet files may have been written with Hive partitioning.
validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the schema data from the parquet files. If the stored collection schema matches this collection schema, the collection is read without validation. If the stored schema mismatches this schema no metadata can be found in the parquets, or the files have conflicting metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the metadata stores a collection schema that matches this collection."skip": The method never runs validation and simply reads the data, entrusting the user that the schema is valid. _Use this option carefully_.
- kwargs: Additional keyword arguments passed directly to
polars.read_parquet().
- Returns:
The initialized collection.
- Raises:
- ValidationRequiredError: If no collection schema can be read from the
directory and
validationis set to"forbid".- ValueError: If the provided directory does not contain parquet files for
all required members.
ValidationError: If the collection cannot be validate.
- Note:
This method is backward compatible with older versions of dataframely in which the schema metadata was saved to schema.json files instead of being encoded into the parquet files.
- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- classmethod required_members() set[str][source]¶
The names of all required members of the collection.
- classmethod sample(num_rows: int | None = None, *, overrides: Sequence[Mapping[str, Any]] | None = None, generator: Generator | None = None) Self[source]¶
Create a random sample from the members of this collection.
Just like sampling for schemas, this method should only be used for testing. Contrary to sampling for schemas, the core difficulty when sampling related values data frames is that they must share primary keys and individual members may have a different number of rows. For this reason, overrides passed to this function must be “row-oriented” (or “sample-oriented”).
- Args:
- num_rows: The number of rows to sample for each member. If this is set to
None, the number of rows is inferred from the length of the overrides.- overrides: The overrides to set values in member schemas. The overrides must
be provided as a list of samples. The structure of the samples must be as follows:
{ "<primary_key_1>": <value>, "<primary_key_2>": <value>, "<member_with_common_primary_key>": { "<column_1>": <value>, ... }, "<member_with_superkey_of_primary_key>": [ { "<column_1>": <value>, ... } ], ... }
Any member/value can be left out and will be sampled automatically. Note that overrides for columns of members that are annotated with
inline_for_sampling=Truecan be supplied on the top-level instead of in a nested dictionary.- generator: The (seeded) generator to use for sampling data. If
None, a generator with random seed is automatically created.
- Returns:
A collection where all members (including optional ones) have been sampled according to the input parameters.
- Attention:
In case the collection has members with a common primary key, the _preprocess_sample method must return distinct primary key values for each sample. The default implementation does this on a best-effort basis but may cause primary key violations. Hence, it is recommended to override this method and ensure that all primary key columns are set.
- Raises:
- ValueError: If the
_preprocess_sample()method does not return all common primary key columns for all samples.
- ValidationError: If the sampled members violate any of the collection
filters. If the collection does not have filters, this error is never raised. To prevent validation errors, overwrite the
_preprocess_sample()method appropriately.
- ValueError: If the
- classmethod scan_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) Self[source]¶
Lazily read all collection members from Delta Lake tables.
This method reads each member from a Delta Lake table at the provided source location. The source can be a path, URI, or an existing DeltaTable object. Optional members are only read if present.
- Args:
source: The location or DeltaTable to read from. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the schema data from the parquet files. If the stored collection schema matches this collection schema, the collection is read without validation. If the stored schema mismatches this schema no metadata can be found in the parquets, or the files have conflicting metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the metadata stores a collection schema that matches this collection."skip": The method never runs validation and simply reads the data, entrusting the user that the schema is valid. _Use this option carefully_.
kwargs: Additional keyword arguments passed directly to
polars.scan_delta().- Returns:
The initialized collection.
- Raises:
ValidationRequiredError: If no collection schema can be read from the source and
validationis set to"forbid". ValueError: If the provided source does not contain Delta tables for all required members.- Note:
Due to current limitations in dataframely, this method may read the Delta table into memory if
validationis"warn"or"allow"and validation is required.- Attention:
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
Be aware that this method suffers from the same limitations as
serialize().
- classmethod scan_parquet(directory: str | Path, *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) Self[source]¶
Lazily read all collection members from parquet files in a directory.
This method searches for files named
<member>.parquetin the provided directory for all required and optional members of the collection.- Args:
- directory: The directory where the Parquet files should be read from.
Parquet files may have been written with Hive partitioning.
validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the schema data from the parquet files. If the stored collection schema matches this collection schema, the collection is read without validation. If the stored schema mismatches this schema no metadata can be found in the parquets, or the files have conflicting metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the metadata stores a collection schema that matches this collection."skip": The method never runs validation and simply reads the data, entrusting the user that the schema is valid. _Use this option carefully_.
- kwargs: Additional keyword arguments passed directly to
polars.scan_parquet()for all members.
- Returns:
The initialized collection.
- Raises:
- ValidationRequiredError: If no collection schema can be read from the
directory and
validationis set to"forbid".- ValueError: If the provided directory does not contain parquet files for
all required members.
- Note:
Due to current limitations in dataframely, this method actually reads the parquet file into memory if
"validation"is"warn"or"allow"and validation is required.- Note: This method is backward compatible with older versions of dataframely
in which the schema metadata was saved to schema.json files instead of being encoded into the parquet files.
- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- classmethod serialize() str[source]¶
Serialize this collection to a JSON string.
This method does NOT serialize any data frames, but only the _structure_ of the collection, similar to
Schema.serialize().- Returns:
The serialized collection.
- Note:
Serialization within dataframely itself will remain backwards-compatible at least within a major version. Until further notice, it will also be backwards-compatible across major versions.
- Attention:
Serialization of
polarsexpressions and lazy frames is not guaranteed to be stable across versions of polars. This affects collections with filters or members that define custom rules or columns with custom checks: a collection serialized with one version of polars may not be deserializable with another version of polars.- Attention:
This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.
- Raises:
- TypeError: If a column of any member contains metadata that is not
JSON-serializable.
- ValueError: If a column of any member is not a “native” dataframely column
type but a custom subclass.
- sink_parquet(directory: str | Path, **kwargs: Any) None[source]¶
Stream the members of this collection into parquet files in a directory.
This method writes one parquet file per member into the provided directory. Each parquet file is named
<member>.parquet. No file is written for optional members which are not provided in the current collection.- Args:
- directory: The directory where the Parquet files should be written to. If
the directory does not exist, it is created automatically, including all of its parents.
- kwargs: Additional keyword arguments passed directly to
polars.sink_parquet()of all members.metadatamay only be provided if it is a dictionary.
- Attention:
This method suffers from the same limitations as
Schema.serialize().
- classmethod validate(data: Mapping[str, FrameType], /, *, cast: bool = False) Self[source]¶
Validate that a set of data frames satisfy the collection’s invariants.
- Args:
- data: The members of the collection which ought to be validated. The
dictionary must contain exactly one entry per member with the name of the member as key.
- cast: Whether columns with a wrong data type in the member data frame are
cast to their schemas’ defined data types if possible.
- Raises:
- ValueError: If an insufficient set of input data frames is provided, i.e. if
any required member of this collection is missing in the input.
- ValidationError: If any of the input data frames does not satisfy its schema
definition or the filters on this collection result in the removal of at least one row across any of the input data frames.
- Returns:
An instance of the collection. All members of the collection are guaranteed to be valid with respect to their respective schemas and the filters on this collection did not remove rows from any member. The input order of each member is maintained.
- write_delta(target: str | Path | deltalake.DeltaTable, **kwargs: Any) None[source]¶
Write the members of this collection to Delta Lake tables.
This method writes each member to a Delta Lake table at the provided target location. The target can be a path, URI, or an existing DeltaTable object. No table is written for optional members which are not provided in the current collection.
- Args:
- target: The location or DeltaTable where the data should be written.
If the location does not exist, it is created automatically, including all of its parents.
kwargs: Additional keyword arguments passed directly to
polars.write_delta().- Attention:
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
This method suffers from the same limitations as
Schema.serialize().
- write_parquet(directory: str | Path, **kwargs: Any) None[source]¶
Write the members of this collection to parquet files in a directory.
This method writes one parquet file per member into the provided directory. Each parquet file is named
<member>.parquet. No file is written for optional members which are not provided in the current collection.- Args:
- directory: The directory where the Parquet files should be written to. If
the directory does not exist, it is created automatically, including all of its parents.
- kwargs: Additional keyword arguments passed directly to
polars.write_parquet()of all members.metadatamay only be provided if it is a dictionary.
- Attention:
This method suffers from the same limitations as
Schema.serialize().
- class dataframely.CollectionMember(*, ignored_in_filters: bool = False, inline_for_sampling: bool = False)[source]¶
Bases:
objectAn annotation class that configures different behavior for a collection member.
- Members:
- ignored_in_filters: Indicates that a member should be ignored in the
@dy.filtermethods of a collection. This also affects the computation of the shared primary key in the collection.
- Example:
class MyCollection(dy.Collection): a: dy.LazyFrame[MySchema1] b: dy.LazyFrame[MySchema2] ignored_member: Annotated[ dy.LazyFrame[MySchema3], dy.CollectionMember(ignored_in_filters=True) ] @dy.filter def my_filter(self) -> pl.DataFrame: return self.a.join(self.b, on="shared_key")
- ignored_in_filters: bool = False¶
Whether the member should be ignored in the filter method.
- inline_for_sampling: bool = False¶
Whether the member’s non-primary key columns should be inlined for sampling. This means that value overrides are supplied on the top-level rather than in a subkey with the member’s name. Only valid if the member’s primary key matches the collection’s common primary key. Two members that share common column names may not both be inlined for sampling.
- class dataframely.Column(*, nullable: bool | None = None, primary_key: bool = False, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ABCAbstract base class for data frame column definitions.
This class is merely supposed to be used in
Schemadefinitions.- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- abstract property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- abstract property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- abstractmethod sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Config(**options: Unpack[Options])[source]¶
Bases:
ContextDecoratorAn object to track global configuration for operations in dataframely.
Methods
__call__(func)Call self as a function.
Restore the defaults of the configuration.
set_max_sampling_iterations(iterations)Set the maximum number of sampling iterations to use on
Schema.sample().- static set_max_sampling_iterations(iterations: int) None[source]¶
Set the maximum number of sampling iterations to use on
Schema.sample().
- class dataframely.DataFrame(data: FrameInitTypes | None = None, schema: SchemaDefinition | None = None, *, schema_overrides: SchemaDict | None = None, strict: bool = True, orient: Orientation | None = None, infer_schema_length: int | None = 100, nan_to_null: bool = False)[source]¶
Bases:
DataFrame,Generic[S]Generic wrapper around a
polars.DataFrameto attach schema information.This class is merely used for the type system and never actually instantiated. This means that it won’t exist at runtime and
isinstance(PoalrsDataFrame, <var>)will always fail. Accordingly, users should not try to create instances of this class.- Attributes:
- clear
- clone
columnsGet or set column names.
dtypesGet the column data types.
flagsGet flags that are set on the columns of this DataFrame.
heightGet the number of rows.
- lazy
plotCreate a plot namespace.
- rechunk
schemaGet an ordered mapping of column names to their data type.
- set_sorted
shapeGet the shape of the DataFrame.
- shrink_to_fit
styleCreate a Great Table for styling.
widthGet the number of columns.
Methods
Approximate count of unique values.
bottom_k(k, *, by[, reverse])Return the k smallest rows.
cast(dtypes, *[, strict])Cast DataFrame column(s) to the specified dtype(s).
Get an ordered mapping of column names to their data type.
corr(**kwargs)Return pairwise Pearson product-moment correlation coefficients between columns.
count()Return the number of non-null elements for each column.
describe([percentiles, interpolation])Summary statistics for a DataFrame.
deserialize(source, *[, format])Read a serialized DataFrame from a file.
drop(*columns[, strict])Remove columns from the dataframe.
drop_in_place(name)Drop a single column in-place and return the dropped column.
drop_nans([subset])Drop all rows that contain one or more NaN values.
drop_nulls([subset])Drop all rows that contain one or more null values.
equals(other, *[, null_equal])Check whether the DataFrame is equal to another DataFrame.
estimated_size([unit])Return an estimation of the total (heap) allocated size of the DataFrame.
explode(columns, *more_columns)Explode the dataframe to long format by exploding the given columns.
extend(other)Extend the memory backed by this DataFrame with the values from other.
fill_nan(value)Fill floating point NaN values by an Expression evaluation.
fill_null([value, strategy, limit, ...])Fill null values using the specified value or strategy.
filter(*predicates, **constraints)Filter rows, retaining those that match the given predicate expression(s).
fold(operation)Apply a horizontal reduction on a DataFrame.
gather_every(n[, offset])Take every nth row in the DataFrame and return as a new DataFrame.
Get a single column by name.
get_column_index(name)Find the index of a column by name.
Get the DataFrame as a List of Series.
glimpse()Return a dense preview of the DataFrame.
group_by(*by[, maintain_order])Start a group by operation.
group_by_dynamic(index_column, *, every[, ...])Group based on a time value (or index value of type Int32, Int64).
hash_rows([seed, seed_1, seed_2, seed_3])Hash and combine the rows in this DataFrame.
head([n])Get the first n rows.
hstack(columns, *[, in_place])Return a new DataFrame grown horizontally by stacking multiple Series to it.
insert_column(index, column)Insert a Series (or expression) at a certain column index.
Interpolate intermediate values.
Get a mask of all duplicated rows in this DataFrame.
is_empty()Returns True if the DataFrame contains no rows.
Get a mask of all unique rows in this DataFrame.
item([row, column])Return the DataFrame as a scalar, or return the element at the given row/column.
Returns an iterator over the columns of this DataFrame.
Returns an iterator over the DataFrame of rows of python-native values.
iter_slices([n_rows])Returns a non-copying iterator of slices over the underlying DataFrame.
join(other[, on, how, left_on, right_on, ...])Join in SQL-like fashion.
join_asof(other, *[, left_on, right_on, on, ...])Perform an asof join.
join_where(other, *predicates[, suffix])Perform a join based on one or multiple (in)equality predicates.
limit([n])Get the first n rows.
map_columns(column_names, function, *args, ...)Apply eager functions to columns of a DataFrame.
map_rows(function[, return_dtype, ...])Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
match_to_schema(schema, *[, ...])Match or evolve the schema of a LazyFrame into a specific schema.
max()Aggregate the columns of this DataFrame to their maximum value.
Get the maximum value horizontally across columns.
mean()Aggregate the columns of this DataFrame to their mean value.
mean_horizontal(*[, ignore_nulls])Take the mean of all values horizontally across columns.
median()Aggregate the columns of this DataFrame to their median value.
melt([id_vars, value_vars, variable_name, ...])Unpivot a DataFrame from wide to long format.
merge_sorted(other, key)Take two sorted DataFrames and merge them by the sorted key.
min()Aggregate the columns of this DataFrame to their minimum value.
Get the minimum value horizontally across columns.
n_chunks()Get number of chunks used by the ChunkedArrays of this DataFrame.
n_unique([subset])Return the number of unique rows, or the number of unique row-subsets.
Create a new DataFrame that shows the null counts per column.
Group by the given columns and return the groups as separate dataframes.
pipe(function, *args, **kwargs)Offers a structured way to apply a sequence of user-defined functions (UDFs).
pivot(on, *[, index, values, ...])Create a spreadsheet-style pivot table as a DataFrame.
product()Aggregate the columns of this DataFrame to their product values.
quantile(quantile[, interpolation])Aggregate the columns of this DataFrame to their quantile value.
remove(*predicates, **constraints)Remove rows, dropping those that match the given predicate expression(s).
rename(mapping, *[, strict])Rename column names.
replace_column(index, column)Replace a column at an index location.
reverse()Reverse the DataFrame.
rolling(index_column, *, period[, offset, ...])Create rolling groups based on a temporal or integer column.
row()Get the values of a single row, either by index or by predicate.
rows()Returns all data in the DataFrame as a list of rows of python-native values.
Returns all data as a dictionary of python-native values keyed by some column.
sample([n, fraction, with_replacement, ...])Sample from this DataFrame.
select(*exprs, **named_exprs)Select columns from this DataFrame.
select_seq(*exprs, **named_exprs)Select columns from this DataFrame.
Serialize this DataFrame to a file or string in JSON format.
shift([n, fill_value])Shift values by the given number of indices.
slice(offset[, length])Get a slice of this DataFrame.
sort(by, *more_by[, descending, nulls_last, ...])Sort the dataframe by the given columns.
sql(query, *[, table_name])Execute a SQL query against the DataFrame.
std([ddof])Aggregate the columns of this DataFrame to their standard deviation value.
sum()Aggregate the columns of this DataFrame to their sum value.
sum_horizontal(*[, ignore_nulls])Sum all values horizontally across columns.
tail([n])Get the last n rows.
to_arrow(*[, compat_level])Collect the underlying arrow arrays in an Arrow Table.
to_dict()Convert DataFrame to a dictionary mapping column name to values.
to_dicts()Convert every row to a dictionary of Python-native values.
to_dummies([columns, separator, drop_first, ...])Convert categorical variables into dummy/indicator variables.
to_init_repr([n])Convert DataFrame to instantiable string representation.
to_jax()Convert DataFrame to a Jax Array, or dict of Jax Arrays.
to_numpy(*[, order, writable, allow_copy, ...])Convert this DataFrame to a NumPy ndarray.
to_pandas(*[, use_pyarrow_extension_array])Convert this DataFrame to a pandas DataFrame.
to_series([index])Select column as Series at index location.
to_struct([name])Convert a DataFrame to a Series of type Struct.
to_torch()Convert DataFrame to a PyTorch Tensor, Dataset, or dict of Tensors.
top_k(k, *, by[, reverse])Return the k largest rows.
transpose(*[, include_header, header_name, ...])Transpose a DataFrame over the diagonal.
unique([subset, keep, maintain_order])Drop duplicate rows from this dataframe.
unnest(columns, *more_columns)Decompose struct columns into separate columns for each of their fields.
unpivot([on, index, variable_name, value_name])Unpivot a DataFrame from wide to long format.
unstack(*, step[, how, columns, fill_values])Unstack a long table to a wide form without doing an aggregation.
update(other[, on, how, left_on, right_on, ...])Update the values in this DataFrame with the values in other.
upsample(time_column, *, every[, group_by, ...])Upsample a DataFrame at a regular frequency.
var([ddof])Aggregate the columns of this DataFrame to their variance value.
vstack(other, *[, in_place])Grow this DataFrame vertically by stacking a DataFrame to it.
with_columns(*exprs, **named_exprs)Add columns to this DataFrame.
with_columns_seq(*exprs, **named_exprs)Add columns to this DataFrame.
with_row_count([name, offset])Add a column at index 0 that counts the rows.
with_row_index([name, offset])Add a row index as the first column in the DataFrame.
write_avro(file[, compression, name])Write to Apache Avro file.
write_clipboard(*[, separator])Copy DataFrame in csv format to the system clipboard with write_csv.
Write to comma-separated values (CSV) file.
write_database(table_name, connection, *[, ...])Write the data in a Polars DataFrame to a database.
Write DataFrame as delta table.
write_excel([workbook, worksheet, position, ...])Write frame data to a table in an Excel workbook/worksheet.
write_iceberg(target, mode)Write DataFrame to an Iceberg table.
Write to Arrow IPC binary stream or Feather file.
Write to Arrow IPC record batch stream.
Serialize to JSON representation.
Serialize to newline delimited JSON representation.
write_parquet(file, *[, compression, ...])Write to Apache Parquet file.
- approx_n_unique() DataFrame[source]¶
Approximate count of unique values.
Deprecated since version 0.20.11: Use the select(pl.all().approx_n_unique()) method instead.
This is done using the HyperLogLog++ algorithm for cardinality estimation.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> df.approx_n_unique() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ u32 ┆ u32 │ ╞═════╪═════╡ │ 4 ┆ 2 │ └─────┴─────┘
- bottom_k(k: int, *, by: IntoExpr | Iterable[IntoExpr], reverse: bool | Sequence[bool] = False) DataFrame[source]¶
Return the k smallest rows.
Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call
sort()after this function if you wish the output to be sorted.Changed in version 1.0.0: The descending parameter was renamed reverse.
- Parameters:
- k
Number of rows to return.
- by
Column(s) used to determine the bottom rows. Accepts expression input. Strings are parsed as column names.
- reverse
Consider the k largest elements of the by column(s) (instead of the k smallest). This can be specified per column by passing a sequence of booleans.
See also
Examples
>>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... )
Get the rows which contain the 4 smallest values in column b.
>>> df.bottom_k(4, by="b") shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 1 │ │ a ┆ 1 │ │ c ┆ 1 │ │ a ┆ 2 │ └─────┴─────┘
Get the rows which contain the 4 smallest values when sorting on column a and b.
>>> df.bottom_k(4, by=["a", "b"]) shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ b ┆ 1 │ │ b ┆ 2 │ └─────┴─────┘
- cast(dtypes: Mapping[ColumnNameOrSelector | PolarsDataType, PolarsDataType | PythonDataType] | PolarsDataType, *, strict: bool = True) DataFrame[source]¶
Cast DataFrame column(s) to the specified dtype(s).
- Parameters:
- dtypes
Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.
- strict
Raise if cast is invalid on rows after predicates are pushed down. If False, invalid casts will produce null values.
Examples
>>> from datetime import date >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": [date(2020, 1, 2), date(2021, 3, 4), date(2022, 5, 6)], ... } ... )
Cast specific frame columns to the specified dtypes:
>>> df.cast({"foo": pl.Float32, "bar": pl.UInt8}) shape: (3, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f32 ┆ u8 ┆ date │ ╞═════╪═════╪════════════╡ │ 1.0 ┆ 6 ┆ 2020-01-02 │ │ 2.0 ┆ 7 ┆ 2021-03-04 │ │ 3.0 ┆ 8 ┆ 2022-05-06 │ └─────┴─────┴────────────┘
Cast all frame columns matching one dtype (or dtype group) to another dtype:
>>> df.cast({pl.Date: pl.Datetime}) shape: (3, 3) ┌─────┬─────┬─────────────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ datetime[μs] │ ╞═════╪═════╪═════════════════════╡ │ 1 ┆ 6.0 ┆ 2020-01-02 00:00:00 │ │ 2 ┆ 7.0 ┆ 2021-03-04 00:00:00 │ │ 3 ┆ 8.0 ┆ 2022-05-06 00:00:00 │ └─────┴─────┴─────────────────────┘
Use selectors to define the columns being cast:
>>> import polars.selectors as cs >>> df.cast({cs.numeric(): pl.UInt32, cs.temporal(): pl.String}) shape: (3, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ str │ ╞═════╪═════╪════════════╡ │ 1 ┆ 6 ┆ 2020-01-02 │ │ 2 ┆ 7 ┆ 2021-03-04 │ │ 3 ┆ 8 ┆ 2022-05-06 │ └─────┴─────┴────────────┘
Cast all frame columns to the specified dtype:
>>> df.cast(pl.String).to_dict(as_series=False) {'foo': ['1', '2', '3'], 'bar': ['6.0', '7.0', '8.0'], 'ham': ['2020-01-02', '2021-03-04', '2022-05-06']}
- clear = None¶
- clone = None¶
- collect_schema() Schema[source]¶
Get an ordered mapping of column names to their data type.
This is an alias for the
schemaproperty.See also
Notes
This method is included to facilitate writing code that is generic for both DataFrame and LazyFrame.
Examples
Determine the schema.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.collect_schema() Schema({'foo': Int64, 'bar': Float64, 'ham': String})
Access various properties of the schema using the
Schemaobject.>>> schema = df.collect_schema() >>> schema["bar"] Float64 >>> schema.names() ['foo', 'bar', 'ham'] >>> schema.dtypes() [Int64, Float64, String] >>> schema.len() 3
- property columns: list[str]¶
Get or set column names.
- Returns:
- list of str
A list containing the name of each column in order.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.columns ['foo', 'bar', 'ham']
Set column names:
>>> df.columns = ["apple", "banana", "orange"] >>> df shape: (3, 3) ┌───────┬────────┬────────┐ │ apple ┆ banana ┆ orange │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪════════╪════════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └───────┴────────┴────────┘
- corr(**kwargs: Any) DataFrame[source]¶
Return pairwise Pearson product-moment correlation coefficients between columns.
See numpy corrcoef for more information: https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html
- Parameters:
- **kwargs
Keyword arguments are passed to numpy corrcoef.
Notes
This functionality requires numpy to be installed.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [3, 2, 1], "ham": [7, 8, 9]}) >>> df.corr() shape: (3, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪══════╡ │ 1.0 ┆ -1.0 ┆ 1.0 │ │ -1.0 ┆ 1.0 ┆ -1.0 │ │ 1.0 ┆ -1.0 ┆ 1.0 │ └──────┴──────┴──────┘
- count() DataFrame[source]¶
Return the number of non-null elements for each column.
Examples
>>> df = pl.DataFrame( ... {"a": [1, 2, 3, 4], "b": [1, 2, 1, None], "c": [None, None, None, None]} ... ) >>> df.count() shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 │ ╞═════╪═════╪═════╡ │ 4 ┆ 3 ┆ 0 │ └─────┴─────┴─────┘
- describe(percentiles: Sequence[float] | float | None = (0.25, 0.5, 0.75), *, interpolation: QuantileMethod = 'nearest') DataFrame[source]¶
Summary statistics for a DataFrame.
- Parameters:
- percentiles
One or more percentiles to include in the summary statistics. All values must be in the range [0, 1].
- interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’, ‘equiprobable’}
Interpolation method used when calculating percentiles.
Warning
We do not guarantee the output of describe to be stable. It will show statistics that we deem informative, and may be updated in the future. Using describe programmatically (versus interactive exploration) is not recommended for this reason.
See also
Notes
The median is included by default as the 50% percentile.
Examples
>>> from datetime import date, time >>> df = pl.DataFrame( ... { ... "float": [1.0, 2.8, 3.0], ... "int": [40, 50, None], ... "bool": [True, False, True], ... "str": ["zz", "xx", "yy"], ... "date": [date(2020, 1, 1), date(2021, 7, 5), date(2022, 12, 31)], ... "time": [time(10, 20, 30), time(14, 45, 50), time(23, 15, 10)], ... } ... )
Show default frame statistics:
>>> df.describe() shape: (9, 7) ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────┬──────────┐ │ statistic ┆ float ┆ int ┆ bool ┆ str ┆ date ┆ time │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str ┆ str │ ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════╪══════════╡ │ count ┆ 3.0 ┆ 2.0 ┆ 3.0 ┆ 3 ┆ 3 ┆ 3 │ │ null_count ┆ 0.0 ┆ 1.0 ┆ 0.0 ┆ 0 ┆ 0 ┆ 0 │ │ mean ┆ 2.266667 ┆ 45.0 ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 ┆ 16:07:10 │ │ std ┆ 1.101514 ┆ 7.071068 ┆ null ┆ null ┆ null ┆ null │ │ min ┆ 1.0 ┆ 40.0 ┆ 0.0 ┆ xx ┆ 2020-01-01 ┆ 10:20:30 │ │ 25% ┆ 2.8 ┆ 40.0 ┆ null ┆ null ┆ 2021-07-05 ┆ 14:45:50 │ │ 50% ┆ 2.8 ┆ 50.0 ┆ null ┆ null ┆ 2021-07-05 ┆ 14:45:50 │ │ 75% ┆ 3.0 ┆ 50.0 ┆ null ┆ null ┆ 2022-12-31 ┆ 23:15:10 │ │ max ┆ 3.0 ┆ 50.0 ┆ 1.0 ┆ zz ┆ 2022-12-31 ┆ 23:15:10 │ └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────┴──────────┘
Customize which percentiles are displayed, applying linear interpolation:
>>> with pl.Config(tbl_rows=12): ... df.describe( ... percentiles=[0.1, 0.3, 0.5, 0.7, 0.9], ... interpolation="linear", ... ) shape: (11, 7) ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────┬──────────┐ │ statistic ┆ float ┆ int ┆ bool ┆ str ┆ date ┆ time │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str ┆ str │ ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════╪══════════╡ │ count ┆ 3.0 ┆ 2.0 ┆ 3.0 ┆ 3 ┆ 3 ┆ 3 │ │ null_count ┆ 0.0 ┆ 1.0 ┆ 0.0 ┆ 0 ┆ 0 ┆ 0 │ │ mean ┆ 2.266667 ┆ 45.0 ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 ┆ 16:07:10 │ │ std ┆ 1.101514 ┆ 7.071068 ┆ null ┆ null ┆ null ┆ null │ │ min ┆ 1.0 ┆ 40.0 ┆ 0.0 ┆ xx ┆ 2020-01-01 ┆ 10:20:30 │ │ 10% ┆ 1.36 ┆ 41.0 ┆ null ┆ null ┆ 2020-04-20 ┆ 11:13:34 │ │ 30% ┆ 2.08 ┆ 43.0 ┆ null ┆ null ┆ 2020-11-26 ┆ 12:59:42 │ │ 50% ┆ 2.8 ┆ 45.0 ┆ null ┆ null ┆ 2021-07-05 ┆ 14:45:50 │ │ 70% ┆ 2.88 ┆ 47.0 ┆ null ┆ null ┆ 2022-02-07 ┆ 18:09:34 │ │ 90% ┆ 2.96 ┆ 49.0 ┆ null ┆ null ┆ 2022-09-13 ┆ 21:33:18 │ │ max ┆ 3.0 ┆ 50.0 ┆ 1.0 ┆ zz ┆ 2022-12-31 ┆ 23:15:10 │ └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────┴──────────┘
- classmethod deserialize(source: str | Path | IOBase, *, format: SerializationFormat = 'binary') DataFrame[source]¶
Read a serialized DataFrame from a file.
- Parameters:
- source
Path to a file or a file-like object (by file-like object, we refer to objects that have a read() method, such as a file handler (e.g. via builtin open function) or BytesIO).
- format
The format with which the DataFrame was serialized. Options:
“binary”: Deserialize from binary format (bytes). This is the default.
“json”: Deserialize from JSON format (string).
See also
Notes
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Examples
>>> import io >>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0]}) >>> bytes = df.serialize() >>> pl.DataFrame.deserialize(io.BytesIO(bytes)) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 4.0 │ │ 2 ┆ 5.0 │ │ 3 ┆ 6.0 │ └─────┴─────┘
- drop(*columns: ColumnNameOrSelector | Iterable[ColumnNameOrSelector], strict: bool = True) DataFrame[source]¶
Remove columns from the dataframe.
- Parameters:
- *columns
Names of the columns that should be removed from the dataframe. Accepts column selector input.
- strict
Validate that all column names exist in the current schema, and throw an exception if any do not.
Examples
Drop a single column by passing the name of that column.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.drop("ham") shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 6.0 │ │ 2 ┆ 7.0 │ │ 3 ┆ 8.0 │ └─────┴─────┘
Drop multiple columns by passing a list of column names.
>>> df.drop(["bar", "ham"]) shape: (3, 1) ┌─────┐ │ foo │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Drop multiple columns by passing a selector.
>>> import polars.selectors as cs >>> df.drop(cs.numeric()) shape: (3, 1) ┌─────┐ │ ham │ │ --- │ │ str │ ╞═════╡ │ a │ │ b │ │ c │ └─────┘
Use positional arguments to drop multiple columns.
>>> df.drop("foo", "ham") shape: (3, 1) ┌─────┐ │ bar │ │ --- │ │ f64 │ ╞═════╡ │ 6.0 │ │ 7.0 │ │ 8.0 │ └─────┘
- drop_in_place(name: str) Series[source]¶
Drop a single column in-place and return the dropped column.
- Parameters:
- name
Name of the column to drop.
- Returns:
- Series
The dropped column.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.drop_in_place("ham") shape: (3,) Series: 'ham' [str] [ "a" "b" "c" ]
- drop_nans(subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None) DataFrame[source]¶
Drop all rows that contain one or more NaN values.
The original order of the remaining rows is preserved.
- Parameters:
- subset
Column name(s) for which NaN values are considered; if set to None (default), use all columns (note that only floating-point columns can contain NaNs).
See also
Notes
A NaN value is not the same as a null value. To drop null values, use
drop_nulls().Examples
>>> df = pl.DataFrame( ... { ... "foo": [-20.5, float("nan"), 80.0], ... "bar": [float("nan"), 110.0, 25.5], ... "ham": ["xxx", "yyy", None], ... } ... )
The default behavior of this method is to drop rows where any single value in the row is NaN:
>>> df.drop_nans() shape: (1, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════╪══════╪══════╡ │ 80.0 ┆ 25.5 ┆ null │ └──────┴──────┴──────┘
This behaviour can be constrained to consider only a subset of columns, as defined by name, or with a selector. For example, dropping rows only if there is a NaN in the “bar” column:
>>> df.drop_nans(subset=["bar"]) shape: (2, 3) ┌──────┬───────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════╪═══════╪══════╡ │ NaN ┆ 110.0 ┆ yyy │ │ 80.0 ┆ 25.5 ┆ null │ └──────┴───────┴──────┘
Dropping a row only if all values are NaN requires a different formulation:
>>> df = pl.DataFrame( ... { ... "a": [float("nan"), float("nan"), float("nan"), float("nan")], ... "b": [10.0, 2.5, float("nan"), 5.25], ... "c": [65.75, float("nan"), float("nan"), 10.5], ... } ... ) >>> df.filter(~pl.all_horizontal(pl.all().is_nan())) shape: (3, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞═════╪══════╪═══════╡ │ NaN ┆ 10.0 ┆ 65.75 │ │ NaN ┆ 2.5 ┆ NaN │ │ NaN ┆ 5.25 ┆ 10.5 │ └─────┴──────┴───────┘
- drop_nulls(subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None) DataFrame[source]¶
Drop all rows that contain one or more null values.
The original order of the remaining rows is preserved.
- Parameters:
- subset
Column name(s) for which null values are considered. If set to None (default), use all columns.
See also
Notes
A null value is not the same as a NaN value. To drop NaN values, use
drop_nans().Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, None, 8], ... "ham": ["a", "b", None], ... } ... )
The default behavior of this method is to drop rows where any single value of the row is null.
>>> df.drop_nulls() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
This behaviour can be constrained to consider only a subset of columns, as defined by name or with a selector. For example, dropping rows if there is a null in any of the integer columns:
>>> import polars.selectors as cs >>> df.drop_nulls(subset=cs.integer()) shape: (2, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪══════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ null │ └─────┴─────┴──────┘
Below are some additional examples that show how to drop null values based on other conditions.
>>> df = pl.DataFrame( ... { ... "a": [None, None, None, None], ... "b": [1, 2, None, 1], ... "c": [1, None, None, 1], ... } ... ) >>> df shape: (4, 3) ┌──────┬──────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ null ┆ i64 ┆ i64 │ ╞══════╪══════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ null ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴──────┴──────┘
Drop a row only if all values are null:
>>> df.filter(~pl.all_horizontal(pl.all().is_null())) shape: (3, 3) ┌──────┬─────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ null ┆ i64 ┆ i64 │ ╞══════╪═════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴─────┴──────┘
Drop a column if all values are null:
>>> df[[s.name for s in df if not (s.null_count() == df.height)]] shape: (4, 2) ┌──────┬──────┐ │ b ┆ c │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 1 ┆ 1 │ │ 2 ┆ null │ │ null ┆ null │ │ 1 ┆ 1 │ └──────┴──────┘
- property dtypes: list[DataType]¶
Get the column data types.
The data types can also be found in column headers when printing the DataFrame.
- Returns:
- list of DataType
A list containing the data type of each column in order.
See also
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.dtypes [Int64, Float64, String] >>> df shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- equals(other: DataFrame, *, null_equal: bool = True) bool[source]¶
Check whether the DataFrame is equal to another DataFrame.
- Parameters:
- other
DataFrame to compare with.
- null_equal
Consider null values as equal.
See also
polars.testing.assert_frame_equal
Examples
>>> df1 = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df2 = pl.DataFrame( ... { ... "foo": [3, 2, 1], ... "bar": [8.0, 7.0, 6.0], ... "ham": ["c", "b", "a"], ... } ... ) >>> df1.equals(df1) True >>> df1.equals(df2) False
- estimated_size(unit: SizeUnit = 'b') int | float[source]¶
Return an estimation of the total (heap) allocated size of the DataFrame.
Estimated size is given in the specified unit (bytes by default).
This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, [StructArray]’s size is an upper bound.
When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity.
FFI buffers are included in this estimation.
- Parameters:
- unit{‘b’, ‘kb’, ‘mb’, ‘gb’, ‘tb’}
Scale the returned size to the given unit.
Notes
For data with Object dtype, the estimated size only reports the pointer size, which is a huge underestimation.
Examples
>>> df = pl.DataFrame( ... { ... "x": list(reversed(range(1_000_000))), ... "y": [v / 1000 for v in range(1_000_000)], ... "z": [str(v) for v in range(1_000_000)], ... }, ... schema=[("x", pl.UInt32), ("y", pl.Float64), ("z", pl.String)], ... ) >>> df.estimated_size() 17888890 >>> df.estimated_size("mb") 17.0601749420166
- explode(columns: ColumnNameOrSelector | Iterable[ColumnNameOrSelector], *more_columns: ColumnNameOrSelector) DataFrame[source]¶
Explode the dataframe to long format by exploding the given columns.
- Parameters:
- columns
Column names, expressions, or a selector defining them. The underlying columns being exploded must be of the List or Array data type.
- *more_columns
Additional names of columns to explode, specified as positional arguments.
- Returns:
- DataFrame
Examples
>>> df = pl.DataFrame( ... { ... "letters": ["a", "a", "b", "c"], ... "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]], ... } ... ) >>> df shape: (4, 2) ┌─────────┬───────────┐ │ letters ┆ numbers │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════════╪═══════════╡ │ a ┆ [1] │ │ a ┆ [2, 3] │ │ b ┆ [4, 5] │ │ c ┆ [6, 7, 8] │ └─────────┴───────────┘ >>> df.explode("numbers") shape: (8, 2) ┌─────────┬─────────┐ │ letters ┆ numbers │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════════╪═════════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ a ┆ 3 │ │ b ┆ 4 │ │ b ┆ 5 │ │ c ┆ 6 │ │ c ┆ 7 │ │ c ┆ 8 │ └─────────┴─────────┘
- extend(other: DataFrame) DataFrame[source]¶
Extend the memory backed by this DataFrame with the values from other.
Different from vstack which adds the chunks from other to the chunks of this DataFrame, extend appends the data from other to the underlying memory locations and thus may cause a reallocation.
If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries.
Prefer extend over vstack when you want to do a query after a single append. For instance, during online operations where you add n rows and rerun a query.
Prefer vstack over extend when you want to append many times before doing a query. For instance, when you read in multiple files and want to store them in a single DataFrame. In the latter case, finish the sequence of vstack operations with a rechunk.
- Parameters:
- other
DataFrame to vertically add.
Warning
This method modifies the dataframe in-place. The dataframe is returned for convenience only.
See also
Examples
>>> df1 = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df2 = pl.DataFrame({"foo": [10, 20, 30], "bar": [40, 50, 60]}) >>> df1.extend(df2) shape: (6, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 2 ┆ 5 │ │ 3 ┆ 6 │ │ 10 ┆ 40 │ │ 20 ┆ 50 │ │ 30 ┆ 60 │ └─────┴─────┘
- fill_nan(value: Expr | int | float | None) DataFrame[source]¶
Fill floating point NaN values by an Expression evaluation.
- Parameters:
- value
Value used to fill NaN values.
- Returns:
- DataFrame
DataFrame with NaN values replaced by the given value.
See also
Notes
A NaN value is not the same as a null value. To fill null values, use
fill_null().Examples
>>> df = pl.DataFrame( ... { ... "a": [1.5, 2, float("nan"), 4], ... "b": [0.5, 4, float("nan"), 13], ... } ... ) >>> df.fill_nan(99) shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════╪══════╡ │ 1.5 ┆ 0.5 │ │ 2.0 ┆ 4.0 │ │ 99.0 ┆ 99.0 │ │ 4.0 ┆ 13.0 │ └──────┴──────┘
- fill_null(value: Any | Expr | None = None, strategy: FillNullStrategy | None = None, limit: int | None = None, *, matches_supertype: bool = True) DataFrame[source]¶
Fill null values using the specified value or strategy.
- Parameters:
- value
Value used to fill null values.
- strategy{None, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’}
Strategy used to fill null values.
- limit
Number of consecutive null values to fill when using the ‘forward’ or ‘backward’ strategy.
- matches_supertype
Fill all matching supertype of the fill value.
- Returns:
- DataFrame
DataFrame with None values replaced by the filling strategy.
See also
Notes
A null value is not the same as a NaN value. To fill NaN values, use
fill_nan().Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None, 4], ... "b": [0.5, 4, None, 13], ... } ... ) >>> df.fill_null(99) shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 99 ┆ 99.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘ >>> df.fill_null(strategy="forward") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
>>> df.fill_null(strategy="max") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
>>> df.fill_null(strategy="zero") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 0 ┆ 0.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
- filter(*predicates: IntoExprColumn | Iterable[IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any], **constraints: Any) DataFrame[source]¶
Filter rows, retaining those that match the given predicate expression(s).
The original order of the remaining rows is preserved.
Only rows where the predicate resolves as True are retained; when the predicate result is False (or null), the row is discarded.
- Parameters:
- predicates
Expression(s) that evaluate to a boolean Series.
- constraints
Column filters; use name = value to filter columns by the supplied value. Each constraint will behave the same as pl.col(name).eq(value), and be implicitly joined with the other filter conditions using &.
See also
Notes
If you are transitioning from Pandas, and performing filter operations based on the comparison of two or more columns, please note that in Polars any comparison involving null values will result in a null result, not boolean True or False. As a result, these rows will not be retained. Ensure that null values are handled appropriately to avoid unexpected behaviour (see examples below).
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, None, 4, None, 0], ... "bar": [6, 7, 8, None, None, 9, 0], ... "ham": ["a", "b", "c", None, "d", "e", "f"], ... } ... )
Filter rows matching a condition:
>>> df.filter(pl.col("foo") > 1) shape: (3, 3) ┌─────┬──────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪══════╪═════╡ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ │ 4 ┆ null ┆ d │ └─────┴──────┴─────┘
Filter on multiple conditions, combined with and/or operators:
>>> df.filter( ... (pl.col("foo") < 3) & (pl.col("ham") == "a"), ... ) shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
>>> df.filter( ... (pl.col("foo") == 1) | (pl.col("ham") == "c"), ... ) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
Provide multiple filters using *args syntax:
>>> df.filter( ... pl.col("foo") <= 2, ... ~pl.col("ham").is_in(["b", "c"]), ... ) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 0 ┆ 0 ┆ f │ └─────┴─────┴─────┘
Provide multiple filters using **kwargs syntax:
>>> df.filter(foo=2, ham="b") shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘
Filter by comparing two columns against each other:
>>> df.filter( ... pl.col("foo") == pl.col("bar"), ... ) shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 0 ┆ 0 ┆ f │ └─────┴─────┴─────┘
>>> df.filter( ... pl.col("foo") != pl.col("bar"), ... ) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
Notice how the row with None values is filtered out. In order to keep the same behavior as pandas, use:
>>> df.filter( ... pl.col("foo").ne_missing(pl.col("bar")), ... ) shape: (5, 3) ┌──────┬──────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ │ 4 ┆ null ┆ d │ │ null ┆ 9 ┆ e │ └──────┴──────┴─────┘
- property flags: dict[str, dict[str, bool]]¶
Get flags that are set on the columns of this DataFrame.
- Returns:
- dict
Mapping from column names to column flags.
- fold(operation: Callable[[Series, Series], Series]) Series[source]¶
Apply a horizontal reduction on a DataFrame.
This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercast (cast to a similar parent type).
An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance:
Int8 + String = String
Float32 + Int64 = Float32
Float32 + Float64 = Float64
- Parameters:
- operation
function that takes two Series and returns a Series.
Examples
A horizontal sum operation:
>>> df = pl.DataFrame( ... { ... "a": [2, 1, 3], ... "b": [1, 2, 3], ... "c": [1.0, 2.0, 3.0], ... } ... ) >>> df.fold(lambda s1, s2: s1 + s2) shape: (3,) Series: 'a' [f64] [ 4.0 5.0 9.0 ]
A horizontal minimum operation:
>>> df = pl.DataFrame({"a": [2, 1, 3], "b": [1, 2, 3], "c": [1.0, 2.0, 3.0]}) >>> df.fold(lambda s1, s2: s1.zip_with(s1 < s2, s2)) shape: (3,) Series: 'a' [f64] [ 1.0 1.0 3.0 ]
A horizontal string concatenation:
>>> df = pl.DataFrame( ... { ... "a": ["foo", "bar", None], ... "b": [1, 2, 3], ... "c": [1.0, 2.0, 3.0], ... } ... ) >>> df.fold(lambda s1, s2: s1 + s2) shape: (3,) Series: 'a' [str] [ "foo11.0" "bar22.0" null ]
A horizontal boolean or, similar to a row-wise .any():
>>> df = pl.DataFrame( ... { ... "a": [False, False, True], ... "b": [False, True, False], ... } ... ) >>> df.fold(lambda s1, s2: s1 | s2) shape: (3,) Series: 'a' [bool] [ false true true ]
- gather_every(n: int, offset: int = 0) DataFrame[source]¶
Take every nth row in the DataFrame and return as a new DataFrame.
- Parameters:
- n
Gather every n-th row.
- offset
Starting index.
Examples
>>> s = pl.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]}) >>> s.gather_every(2) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 5 │ │ 3 ┆ 7 │ └─────┴─────┘
>>> s.gather_every(2, offset=1) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 6 │ │ 4 ┆ 8 │ └─────┴─────┘
- get_column(name: str, *, default: Any | NoDefault = <no_default>) Series | Any[source]¶
Get a single column by name.
- Parameters:
- name
String name of the column to retrieve.
- default
Value to return if the column does not exist; if not explicitly set and the column is not present a ColumnNotFoundError exception is raised.
- Returns:
- Series (or arbitrary default value, if specified).
See also
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.get_column("foo") shape: (3,) Series: 'foo' [i64] [ 1 2 3 ]
Missing column handling; can optionally provide an arbitrary default value to the method (otherwise a ColumnNotFoundError exception is raised).
>>> df.get_column("baz", default=pl.Series("baz", ["?", "?", "?"])) shape: (3,) Series: 'baz' [str] [ "?" "?" "?" ] >>> res = df.get_column("baz", default=None) >>> res is None True
- get_column_index(name: str) int[source]¶
Find the index of a column by name.
- Parameters:
- name
Name of the column to find.
Examples
>>> df = pl.DataFrame( ... {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]} ... ) >>> df.get_column_index("ham") 2 >>> df.get_column_index("sandwich") ColumnNotFoundError: sandwich
- get_columns() list[Series][source]¶
Get the DataFrame as a List of Series.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.get_columns() [shape: (3,) Series: 'foo' [i64] [ 1 2 3 ], shape: (3,) Series: 'bar' [i64] [ 4 5 6 ]]
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.get_columns() [shape: (4,) Series: 'a' [i64] [ 1 2 3 4 ], shape: (4,) Series: 'b' [f64] [ 0.5 4.0 10.0 13.0 ], shape: (4,) Series: 'c' [bool] [ true true false true ]]
- glimpse(*, max_items_per_column: int = 10, max_colname_length: int = 50, return_as_string: bool = False) str | None[source]¶
Return a dense preview of the DataFrame.
The formatting shows one line per column so that wide dataframes display cleanly. Each line shows the column name, the data type, and the first few values.
- Parameters:
- max_items_per_column
Maximum number of items to show per column.
- max_colname_length
Maximum length of the displayed column names; values that exceed this value are truncated with a trailing ellipsis.
- return_as_string
If True, return the preview as a string instead of printing to stdout.
Examples
>>> from datetime import date >>> df = pl.DataFrame( ... { ... "a": [1.0, 2.8, 3.0], ... "b": [4, 5, None], ... "c": [True, False, True], ... "d": [None, "b", "c"], ... "e": ["usd", "eur", None], ... "f": [date(2020, 1, 1), date(2021, 1, 2), date(2022, 1, 1)], ... } ... ) >>> df.glimpse() Rows: 3 Columns: 6 $ a <f64> 1.0, 2.8, 3.0 $ b <i64> 4, 5, None $ c <bool> True, False, True $ d <str> None, 'b', 'c' $ e <str> 'usd', 'eur', None $ f <date> 2020-01-01, 2021-01-02, 2022-01-01
- group_by(*by: IntoExpr | Iterable[IntoExpr], maintain_order: bool = False, **named_by: IntoExpr) GroupBy[source]¶
Start a group by operation.
- Parameters:
- *by
Column(s) to group by. Accepts expression input. Strings are parsed as column names.
- maintain_order
Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Settings this to True blocks the possibility to run on the streaming engine.
Note
Within each group, the order of rows is always preserved, regardless of this argument.
- **named_by
Additional columns to group by, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
- GroupBy
Object which can be used to perform aggregations.
Examples
Group by one column and call agg to compute the grouped sum of another column.
>>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> df.group_by("a").agg(pl.col("b").sum()) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 2 │ │ b ┆ 5 │ │ c ┆ 3 │ └─────┴─────┘
Set maintain_order=True to ensure the order of the groups is consistent with the input.
>>> df.group_by("a", maintain_order=True).agg(pl.col("c")) shape: (3, 2) ┌─────┬───────────┐ │ a ┆ c │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════╪═══════════╡ │ a ┆ [5, 3] │ │ b ┆ [4, 2] │ │ c ┆ [1] │ └─────┴───────────┘
Group by multiple columns by passing a list of column names.
>>> df.group_by(["a", "b"]).agg(pl.max("c")) shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘
Or use positional arguments to group by multiple columns in the same way. Expressions are also accepted.
>>> df.group_by("a", pl.col("b") // 2).agg(pl.col("c").mean()) shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 │ ╞═════╪═════╪═════╡ │ a ┆ 0 ┆ 4.0 │ │ b ┆ 1 ┆ 3.0 │ │ c ┆ 1 ┆ 1.0 │ └─────┴─────┴─────┘
The GroupBy object returned by this method is iterable, returning the name and data of each group.
>>> for name, data in df.group_by("a"): ... print(name) ... print(data) ('a',) shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘ ('b',) shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘ ('c',) shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘
- group_by_dynamic(index_column: IntoExpr, *, every: str | timedelta, period: str | timedelta | None = None, offset: str | timedelta | None = None, include_boundaries: bool = False, closed: ClosedInterval = 'left', label: Label = 'left', group_by: IntoExpr | Iterable[IntoExpr] | None = None, start_by: StartBy = 'window') DynamicGroupBy[source]¶
Group based on a time value (or index value of type Int32, Int64).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. By default, the windows look like:
[start, start + period)
[start + every, start + every + period)
[start + 2*every, start + 2*every + period)
…
where start is determined by start_by, offset, every, and the earliest datapoint. See the start_by argument description for details.
Warning
The index column must be sorted in ascending order. If group_by is passed, then the index column must be sorted in ascending order within each group.
Changed in version 0.20.14: The by parameter was renamed group_by.
- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if group_by is specified, then it must be sorted in ascending order within each group).
In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- every
interval of the window
- period
length of the window, if None it will equal ‘every’
- offset
offset of the window, does not take effect if start_by is ‘datapoint’. Defaults to zero.
- include_boundaries
Add the lower and upper bound of the window to the “_lower_boundary” and “_upper_boundary” columns. This will impact performance because it’s harder to parallelize
- closed{‘left’, ‘right’, ‘both’, ‘none’}
Define which sides of the temporal interval are closed (inclusive).
- label{‘left’, ‘right’, ‘datapoint’}
Define which label to use for the window:
‘left’: lower boundary of the window
‘right’: upper boundary of the window
‘datapoint’: the first value of the index column in the given window. If you don’t need the label to be at one of the boundaries, choose this option for maximum performance
- group_by
Also group by this column/these columns
- start_by{‘window’, ‘datapoint’, ‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’, ‘friday’, ‘saturday’, ‘sunday’}
The strategy to determine the start of the first window by.
‘window’: Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.
‘datapoint’: Start from the first encountered data point.
a day of the week (only takes effect if every contains ‘w’):
‘monday’: Start the window on the Monday before the first data point.
‘tuesday’: Start the window on the Tuesday before the first data point.
…
‘sunday’: Start the window on the Sunday before the first data point.
The resulting window is then shifted back until the earliest datapoint is in or in front of it.
- Returns:
- DynamicGroupBy
Object you can call .agg on to aggregate by groups, the result of which will be sorted by index_column (but note that if group_by columns are passed, it will only be sorted within each group).
See also
Notes
If you’re coming from pandas, then
# polars df.group_by_dynamic("ts", every="1d").agg(pl.col("value").sum())
is equivalent to
# pandas df.set_index("ts").resample("D")["value"].sum().reset_index()
though note that, unlike pandas, polars doesn’t add extra rows for empty windows. If you need index_column to be evenly spaced, then please combine with
DataFrame.upsample().The every, period and offset arguments are created with the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them (except in every): “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
In case of a group_by_dynamic on an integer column, the windows are defined by:
“1i” # length 1
“10i” # length 10
Examples
>>> from datetime import datetime >>> df = pl.DataFrame( ... { ... "time": pl.datetime_range( ... start=datetime(2021, 12, 16), ... end=datetime(2021, 12, 16, 3), ... interval="30m", ... eager=True, ... ), ... "n": range(7), ... } ... ) >>> df shape: (7, 2) ┌─────────────────────┬─────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ i64 │ ╞═════════════════════╪═════╡ │ 2021-12-16 00:00:00 ┆ 0 │ │ 2021-12-16 00:30:00 ┆ 1 │ │ 2021-12-16 01:00:00 ┆ 2 │ │ 2021-12-16 01:30:00 ┆ 3 │ │ 2021-12-16 02:00:00 ┆ 4 │ │ 2021-12-16 02:30:00 ┆ 5 │ │ 2021-12-16 03:00:00 ┆ 6 │ └─────────────────────┴─────┘
Group by windows of 1 hour.
>>> df.group_by_dynamic("time", every="1h", closed="right").agg(pl.col("n")) shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-15 23:00:00 ┆ [0] │ │ 2021-12-16 00:00:00 ┆ [1, 2] │ │ 2021-12-16 01:00:00 ┆ [3, 4] │ │ 2021-12-16 02:00:00 ┆ [5, 6] │ └─────────────────────┴───────────┘
The window boundaries can also be added to the aggregation result
>>> df.group_by_dynamic( ... "time", every="1h", include_boundaries=True, closed="right" ... ).agg(pl.col("n").mean()) shape: (4, 4) ┌─────────────────────┬─────────────────────┬─────────────────────┬─────┐ │ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ f64 │ ╞═════════════════════╪═════════════════════╪═════════════════════╪═════╡ │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 0.0 │ │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 1.5 │ │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 3.5 │ │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 5.5 │ └─────────────────────┴─────────────────────┴─────────────────────┴─────┘
When closed=”left”, the window excludes the right end of interval: [lower_bound, upper_bound)
>>> df.group_by_dynamic("time", every="1h", closed="left").agg(pl.col("n")) shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-16 00:00:00 ┆ [0, 1] │ │ 2021-12-16 01:00:00 ┆ [2, 3] │ │ 2021-12-16 02:00:00 ┆ [4, 5] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘
When closed=”both” the time values at the window boundaries belong to 2 groups.
>>> df.group_by_dynamic("time", every="1h", closed="both").agg(pl.col("n")) shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ 2021-12-16 01:00:00 ┆ [2, 3, 4] │ │ 2021-12-16 02:00:00 ┆ [4, 5, 6] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘
Dynamic group bys can also be combined with grouping on normal keys
>>> df = df.with_columns(groups=pl.Series(["a", "a", "a", "b", "b", "a", "a"])) >>> df shape: (7, 3) ┌─────────────────────┬─────┬────────┐ │ time ┆ n ┆ groups │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ str │ ╞═════════════════════╪═════╪════════╡ │ 2021-12-16 00:00:00 ┆ 0 ┆ a │ │ 2021-12-16 00:30:00 ┆ 1 ┆ a │ │ 2021-12-16 01:00:00 ┆ 2 ┆ a │ │ 2021-12-16 01:30:00 ┆ 3 ┆ b │ │ 2021-12-16 02:00:00 ┆ 4 ┆ b │ │ 2021-12-16 02:30:00 ┆ 5 ┆ a │ │ 2021-12-16 03:00:00 ┆ 6 ┆ a │ └─────────────────────┴─────┴────────┘ >>> df.group_by_dynamic( ... "time", ... every="1h", ... closed="both", ... group_by="groups", ... include_boundaries=True, ... ).agg(pl.col("n")) shape: (6, 5) ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────┐ │ groups ┆ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ list[i64] │ ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════════╡ │ a ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ a ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [2] │ │ a ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [5, 6] │ │ a ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ [6] │ │ b ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [3, 4] │ │ b ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [4] │ └────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┘
Dynamic group by on an index column
>>> df = pl.DataFrame( ... { ... "idx": pl.int_range(0, 6, eager=True), ... "A": ["A", "A", "B", "B", "B", "C"], ... } ... ) >>> ( ... df.group_by_dynamic( ... "idx", ... every="2i", ... period="3i", ... include_boundaries=True, ... closed="right", ... ).agg(pl.col("A").alias("A_agg_list")) ... ) shape: (4, 4) ┌─────────────────┬─────────────────┬─────┬─────────────────┐ │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ list[str] │ ╞═════════════════╪═════════════════╪═════╪═════════════════╡ │ -2 ┆ 1 ┆ -2 ┆ ["A", "A"] │ │ 0 ┆ 3 ┆ 0 ┆ ["A", "B", "B"] │ │ 2 ┆ 5 ┆ 2 ┆ ["B", "B", "C"] │ │ 4 ┆ 7 ┆ 4 ┆ ["C"] │ └─────────────────┴─────────────────┴─────┴─────────────────┘
- hash_rows(seed: int = 0, seed_1: int | None = None, seed_2: int | None = None, seed_3: int | None = None) Series[source]¶
Hash and combine the rows in this DataFrame.
The hash value is of type UInt64.
- Parameters:
- seed
Random seed parameter. Defaults to 0.
- seed_1
Random seed parameter. Defaults to seed if not set.
- seed_2
Random seed parameter. Defaults to seed if not set.
- seed_3
Random seed parameter. Defaults to seed if not set.
Notes
This implementation of hash_rows does not guarantee stable results across different Polars versions. Its stability is only guaranteed within a single version.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, None, 3, 4], ... "ham": ["a", "b", None, "d"], ... } ... ) >>> df.hash_rows(seed=42) shape: (4,) Series: '' [u64] [ 10783150408545073287 1438741209321515184 10047419486152048166 2047317070637311557 ]
- head(n: int = 5) DataFrame[source]¶
Get the first n rows.
- Parameters:
- n
Number of rows to return. If a negative value is passed, return all rows except the last abs(n).
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> df.head(3) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
Pass a negative value to get all rows except the last abs(n).
>>> df.head(-3) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘
- property height: int¶
Get the number of rows.
- Returns:
- int
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) >>> df.height 5
- hstack(columns: list[Series] | DataFrame, *, in_place: bool = False) DataFrame[source]¶
Return a new DataFrame grown horizontally by stacking multiple Series to it.
- Parameters:
- columns
Series to stack.
- in_place
Modify in place.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> x = pl.Series("apple", [10, 20, 30]) >>> df.hstack([x]) shape: (3, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str ┆ i64 │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6 ┆ a ┆ 10 │ │ 2 ┆ 7 ┆ b ┆ 20 │ │ 3 ┆ 8 ┆ c ┆ 30 │ └─────┴─────┴─────┴───────┘
- insert_column(index: int, column: IntoExprColumn) DataFrame[source]¶
Insert a Series (or expression) at a certain column index.
This operation is in place.
- Parameters:
- index
Index at which to insert the new column.
- column
Series or expression to insert.
Examples
Insert a new Series column at the given index:
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> s = pl.Series("baz", [97, 98, 99]) >>> df.insert_column(1, s) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ baz ┆ bar │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 97 ┆ 4 │ │ 2 ┆ 98 ┆ 5 │ │ 3 ┆ 99 ┆ 6 │ └─────┴─────┴─────┘
Insert a new expression column at the given index:
>>> df = pl.DataFrame( ... {"a": [2, 4, 2], "b": [0.5, 4, 10], "c": ["xx", "yy", "zz"]} ... ) >>> expr = (pl.col("b") / pl.col("a")).alias("b_div_a") >>> df.insert_column(2, expr) shape: (3, 4) ┌─────┬──────┬─────────┬─────┐ │ a ┆ b ┆ b_div_a ┆ c │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 ┆ str │ ╞═════╪══════╪═════════╪═════╡ │ 2 ┆ 0.5 ┆ 0.25 ┆ xx │ │ 4 ┆ 4.0 ┆ 1.0 ┆ yy │ │ 2 ┆ 10.0 ┆ 5.0 ┆ zz │ └─────┴──────┴─────────┴─────┘
- interpolate() DataFrame[source]¶
Interpolate intermediate values. The interpolation method is linear.
Nulls at the beginning and end of the series remain null.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, None, 9, 10], ... "bar": [6, 7, 9, None], ... "baz": [1, None, None, 9], ... } ... ) >>> df.interpolate() shape: (4, 3) ┌──────┬──────┬──────────┐ │ foo ┆ bar ┆ baz │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪══════════╡ │ 1.0 ┆ 6.0 ┆ 1.0 │ │ 5.0 ┆ 7.0 ┆ 3.666667 │ │ 9.0 ┆ 9.0 ┆ 6.333333 │ │ 10.0 ┆ null ┆ 9.0 │ └──────┴──────┴──────────┘
- is_duplicated() Series[source]¶
Get a mask of all duplicated rows in this DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["x", "y", "z", "x"], ... } ... ) >>> df.is_duplicated() shape: (4,) Series: '' [bool] [ true false false true ]
This mask can be used to visualize the duplicated lines like this:
>>> df.filter(df.is_duplicated()) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 1 ┆ x │ │ 1 ┆ x │ └─────┴─────┘
- is_empty() bool[source]¶
Returns True if the DataFrame contains no rows.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.is_empty() False >>> df.filter(pl.col("foo") > 99).is_empty() True
- is_unique() Series[source]¶
Get a mask of all unique rows in this DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["x", "y", "z", "x"], ... } ... ) >>> df.is_unique() shape: (4,) Series: '' [bool] [ false true true false ]
This mask can be used to visualize the unique lines like this:
>>> df.filter(df.is_unique()) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 2 ┆ y │ │ 3 ┆ z │ └─────┴─────┘
- item(row: int | None = None, column: int | str | None = None) Any[source]¶
Return the DataFrame as a scalar, or return the element at the given row/column.
- Parameters:
- row
Optional row index.
- column
Optional column index or name.
See also
rowGet the values of a single row, either by index or by predicate.
Notes
If row/col not provided, this is equivalent to df[0,0], with a check that the shape is (1,1). With row/col, this is equivalent to df[row,col].
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> df.select((pl.col("a") * pl.col("b")).sum()).item() 32 >>> df.item(1, 1) 5 >>> df.item(2, "b") 6
- iter_columns() Iterator[Series][source]¶
Returns an iterator over the columns of this DataFrame.
- Yields:
- Series
Notes
Consider whether you can use
all()instead. If you can, it will be more efficient.Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> [s.name for s in df.iter_columns()] ['a', 'b']
If you’re using this to modify a dataframe’s columns, e.g.
>>> # Do NOT do this >>> pl.DataFrame(column * 2 for column in df.iter_columns()) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 4 │ │ 6 ┆ 8 │ │ 10 ┆ 12 │ └─────┴─────┘
then consider whether you can use
all()instead:>>> df.select(pl.all() * 2) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 4 │ │ 6 ┆ 8 │ │ 10 ┆ 12 │ └─────┴─────┘
- iter_rows(*, named: bool = False, buffer_size: int = 512) Iterator[tuple[Any, ...]] | Iterator[dict[str, Any]][source]¶
Returns an iterator over the DataFrame of rows of python-native values.
- Parameters:
- named
Return dictionaries instead of tuples. The dictionaries are a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name.
- buffer_size
Determines the number of rows that are buffered internally while iterating over the data; you should only modify this in very specific cases where the default value is determined not to be a good fit to your access pattern, as the speedup from using the buffer is significant (~2-4x). Setting this value to zero disables row buffering (not recommended).
- Returns:
- iterator of tuples (default) or dictionaries (if named) of python row values
Warning
Row iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods that deals with columnar data.
See also
rowsMaterialises all frame data as a list of rows (potentially expensive).
rows_by_keyMaterialises frame data as a key-indexed dictionary.
Notes
If you have ns-precision temporal values you should be aware that Python natively only supports up to μs-precision; ns-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> [row[0] for row in df.iter_rows()] [1, 3, 5] >>> [row["b"] for row in df.iter_rows(named=True)] [2, 4, 6]
- iter_slices(n_rows: int = 10000) Iterator[DataFrame][source]¶
Returns a non-copying iterator of slices over the underlying DataFrame.
- Parameters:
- n_rows
Determines the number of rows contained in each DataFrame slice.
See also
iter_rowsRow iterator over frame data (does not materialise all rows).
partition_bySplit into multiple DataFrames, partitioned by groups.
Examples
>>> from datetime import date >>> df = pl.DataFrame( ... data={ ... "a": range(17_500), ... "b": date(2023, 1, 1), ... "c": "klmnoopqrstuvwxyz", ... }, ... schema_overrides={"a": pl.Int32}, ... ) >>> for idx, frame in enumerate(df.iter_slices()): ... print(f"{type(frame).__name__}:[{idx}]:{len(frame)}") DataFrame:[0]:10000 DataFrame:[1]:7500
Using iter_slices is an efficient way to chunk-iterate over DataFrames and any supported frame export/conversion types; for example, as RecordBatches:
>>> for frame in df.iter_slices(n_rows=15_000): ... record_batch = frame.to_arrow().to_batches()[0] ... print(f"{record_batch.schema}\n<< {len(record_batch)}") a: int32 b: date32[day] c: large_string << 15000 a: int32 b: date32[day] c: large_string << 2500
- join(other: DataFrame, on: str | Expr | Sequence[str | Expr] | None = None, how: JoinStrategy = 'inner', *, left_on: str | Expr | Sequence[str | Expr] | None = None, right_on: str | Expr | Sequence[str | Expr] | None = None, suffix: str = '_right', validate: JoinValidation = 'm:m', nulls_equal: bool = False, coalesce: bool | None = None, maintain_order: MaintainOrderJoin | None = None) DataFrame[source]¶
Join in SQL-like fashion.
Changed in version 1.24: The join_nulls parameter was renamed nulls_equal.
- Parameters:
- other
DataFrame to join with.
- on
Name(s) of the join columns in both DataFrames. If set, left_on and right_on should be None. This should not be specified if how=’cross’.
- how{‘inner’, ‘left’, ‘right’, ‘full’, ‘semi’, ‘anti’, ‘cross’}
Join strategy.
inner
(Default) Returns rows that have matching values in both tables.
left
Returns all rows from the left table, and the matched rows from the right table.
full
Returns all rows when there is a match in either left or right.
cross
Returns the Cartesian product of rows from both tables
semi
Returns rows from the left table that have a match in the right table.
anti
Returns rows from the left table that have no match in the right table.
- left_on
Name(s) of the left join column(s).
- right_on
Name(s) of the right join column(s).
- suffix
Suffix to append to columns with a duplicate name.
- validate: {‘m:m’, ‘m:1’, ‘1:m’, ‘1:1’}
Checks if join is of specified type.
m:m
(Default) Many-to-many (default). Does not result in checks.
1:1
One-to-one. Checks if join keys are unique in both left and right datasets.
1:m
One-to-many. Checks if join keys are unique in left dataset.
m:1
Many-to-one. Check if join keys are unique in right dataset.
Note
This is currently not supported by the streaming engine.
- nulls_equal
Join on null values. By default null values will never produce matches.
- coalesce
Coalescing behavior (merging of join columns).
None
(Default) Coalesce unless how=’full’ is specified.
True
Always coalesce join columns.
False
Never coalesce join columns.
Note
Joining on any other expressions than col will turn off coalescing.
- maintain_order{‘none’, ‘left’, ‘right’, ‘left_right’, ‘right_left’}
Which DataFrame row order to preserve, if any. Do not rely on any observed ordering without explicitly setting this parameter, as your code may break in a future release. Not specifying any ordering can improve performance. Supported for inner, left, right and full joins
none
(Default) No specific ordering is desired. The ordering might differ across Polars versions or even between different runs.
left
Preserves the order of the left DataFrame.
right
Preserves the order of the right DataFrame.
left_right
First preserves the order of the left DataFrame, then the right.
right_left
First preserves the order of the right DataFrame, then the left.
See also
Notes
For joining on columns with categorical data, see
polars.StringCache.Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> other_df = pl.DataFrame( ... { ... "apple": ["x", "y", "z"], ... "ham": ["a", "b", "d"], ... } ... ) >>> df.join(other_df, on="ham") shape: (2, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ └─────┴─────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="full") shape: (4, 5) ┌──────┬──────┬──────┬───────┬───────────┐ │ foo ┆ bar ┆ ham ┆ apple ┆ ham_right │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str ┆ str │ ╞══════╪══════╪══════╪═══════╪═══════════╡ │ 1 ┆ 6.0 ┆ a ┆ x ┆ a │ │ 2 ┆ 7.0 ┆ b ┆ y ┆ b │ │ null ┆ null ┆ null ┆ z ┆ d │ │ 3 ┆ 8.0 ┆ c ┆ null ┆ null │ └──────┴──────┴──────┴───────┴───────────┘
>>> df.join(other_df, on="ham", how="full", coalesce=True) shape: (4, 4) ┌──────┬──────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞══════╪══════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ │ null ┆ null ┆ d ┆ z │ │ 3 ┆ 8.0 ┆ c ┆ null │ └──────┴──────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="left") shape: (3, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ │ 3 ┆ 8.0 ┆ c ┆ null │ └─────┴─────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="semi") shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ └─────┴─────┴─────┘
>>> df.join(other_df, on="ham", how="anti") shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
>>> df.join(other_df, how="cross") shape: (9, 5) ┌─────┬─────┬─────┬───────┬───────────┐ │ foo ┆ bar ┆ ham ┆ apple ┆ ham_right │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╪═══════════╡ │ 1 ┆ 6.0 ┆ a ┆ x ┆ a │ │ 1 ┆ 6.0 ┆ a ┆ y ┆ b │ │ 1 ┆ 6.0 ┆ a ┆ z ┆ d │ │ 2 ┆ 7.0 ┆ b ┆ x ┆ a │ │ 2 ┆ 7.0 ┆ b ┆ y ┆ b │ │ 2 ┆ 7.0 ┆ b ┆ z ┆ d │ │ 3 ┆ 8.0 ┆ c ┆ x ┆ a │ │ 3 ┆ 8.0 ┆ c ┆ y ┆ b │ │ 3 ┆ 8.0 ┆ c ┆ z ┆ d │ └─────┴─────┴─────┴───────┴───────────┘
- join_asof(other: DataFrame, *, left_on: str | None | Expr = None, right_on: str | None | Expr = None, on: str | None | Expr = None, by_left: str | Sequence[str] | None = None, by_right: str | Sequence[str] | None = None, by: str | Sequence[str] | None = None, strategy: AsofJoinStrategy = 'backward', suffix: str = '_right', tolerance: str | int | float | timedelta | None = None, allow_parallel: bool = True, force_parallel: bool = False, coalesce: bool = True, allow_exact_matches: bool = True, check_sortedness: bool = True) DataFrame[source]¶
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the on key (within each by group, if specified).
For each row in the left DataFrame:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
A “nearest” search selects the last row in the right DataFrame whose value is nearest to the left’s key. String keys are not currently supported for a nearest search.
The default is “backward”.
- Parameters:
- other
Lazy DataFrame to join with.
- left_on
Join column of the left DataFrame.
- right_on
Join column of the right DataFrame.
- on
Join column of both DataFrames. If set, left_on and right_on should be None.
- by
Join on these columns before doing asof join
- by_left
Join on these columns before doing asof join
- by_right
Join on these columns before doing asof join
- strategy{‘backward’, ‘forward’, ‘nearest’}
Join strategy.
- suffix
Suffix to append to columns with a duplicate name.
- tolerance
Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time”, use either a datetime.timedelta object or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
- allow_parallel
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
- force_parallel
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
- coalesce
Coalescing behavior (merging of on / left_on / right_on columns):
True: Always coalesce join columns.
False: Never coalesce join columns.
Note that joining on any other expressions than col will turn off coalescing.
- allow_exact_matches
Whether exact matches are valid join predicates.
- If True, allow matching with the same
onvalue (i.e. less-than-or-equal-to / greater-than-or-equal-to)
- If True, allow matching with the same
- If False, don’t match the same
onvalue (i.e., strictly less-than / strictly greater-than).
- If False, don’t match the same
- check_sortedness
Check the sortedness of the asof keys. If the keys are not sorted Polars will error. Currently, sortedness cannot be checked if ‘by’ groups are provided.
Examples
>>> from datetime import date >>> gdp = pl.DataFrame( ... { ... "date": pl.date_range( ... date(2016, 1, 1), ... date(2020, 1, 1), ... "1y", ... eager=True, ... ), ... "gdp": [4164, 4411, 4566, 4696, 4827], ... } ... ) >>> gdp shape: (5, 2) ┌────────────┬──────┐ │ date ┆ gdp │ │ --- ┆ --- │ │ date ┆ i64 │ ╞════════════╪══════╡ │ 2016-01-01 ┆ 4164 │ │ 2017-01-01 ┆ 4411 │ │ 2018-01-01 ┆ 4566 │ │ 2019-01-01 ┆ 4696 │ │ 2020-01-01 ┆ 4827 │ └────────────┴──────┘
>>> population = pl.DataFrame( ... { ... "date": [date(2016, 3, 1), date(2018, 8, 1), date(2019, 1, 1)], ... "population": [82.19, 82.66, 83.12], ... } ... ).sort("date") >>> population shape: (3, 2) ┌────────────┬────────────┐ │ date ┆ population │ │ --- ┆ --- │ │ date ┆ f64 │ ╞════════════╪════════════╡ │ 2016-03-01 ┆ 82.19 │ │ 2018-08-01 ┆ 82.66 │ │ 2019-01-01 ┆ 83.12 │ └────────────┴────────────┘
Note how the dates don’t quite match. If we join them using join_asof and strategy=’backward’, then each date from population which doesn’t have an exact match is matched with the closest earlier date from gdp:
>>> population.join_asof(gdp, on="date", strategy="backward") shape: (3, 3) ┌────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ i64 │ ╞════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 4164 │ │ 2018-08-01 ┆ 82.66 ┆ 4566 │ │ 2019-01-01 ┆ 83.12 ┆ 4696 │ └────────────┴────────────┴──────┘
Note how:
date 2016-03-01 from population is matched with 2016-01-01 from gdp;
date 2018-08-01 from population is matched with 2018-01-01 from gdp.
You can verify this by passing coalesce=False:
>>> population.join_asof(gdp, on="date", strategy="backward", coalesce=False) shape: (3, 4) ┌────────────┬────────────┬────────────┬──────┐ │ date ┆ population ┆ date_right ┆ gdp │ │ --- ┆ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ date ┆ i64 │ ╞════════════╪════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 2016-01-01 ┆ 4164 │ │ 2018-08-01 ┆ 82.66 ┆ 2018-01-01 ┆ 4566 │ │ 2019-01-01 ┆ 83.12 ┆ 2019-01-01 ┆ 4696 │ └────────────┴────────────┴────────────┴──────┘
If we instead use strategy=’forward’, then each date from population which doesn’t have an exact match is matched with the closest later date from gdp:
>>> population.join_asof(gdp, on="date", strategy="forward") shape: (3, 3) ┌────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ i64 │ ╞════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 4411 │ │ 2018-08-01 ┆ 82.66 ┆ 4696 │ │ 2019-01-01 ┆ 83.12 ┆ 4696 │ └────────────┴────────────┴──────┘
Note how:
date 2016-03-01 from population is matched with 2017-01-01 from gdp;
date 2018-08-01 from population is matched with 2019-01-01 from gdp.
Finally, strategy=’nearest’ gives us a mix of the two results above, as each date from population which doesn’t have an exact match is matched with the closest date from gdp, regardless of whether it’s earlier or later:
>>> population.join_asof(gdp, on="date", strategy="nearest") shape: (3, 3) ┌────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ i64 │ ╞════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 4164 │ │ 2018-08-01 ┆ 82.66 ┆ 4696 │ │ 2019-01-01 ┆ 83.12 ┆ 4696 │ └────────────┴────────────┴──────┘
Note how:
date 2016-03-01 from population is matched with 2016-01-01 from gdp;
date 2018-08-01 from population is matched with 2019-01-01 from gdp.
They by argument allows joining on another column first, before the asof join. In this example we join by country first, then asof join by date, as above.
>>> gdp_dates = pl.date_range( # fmt: skip ... date(2016, 1, 1), date(2020, 1, 1), "1y", eager=True ... ) >>> gdp2 = pl.DataFrame( ... { ... "country": ["Germany"] * 5 + ["Netherlands"] * 5, ... "date": pl.concat([gdp_dates, gdp_dates]), ... "gdp": [4164, 4411, 4566, 4696, 4827, 784, 833, 914, 910, 909], ... } ... ).sort("country", "date") >>> >>> gdp2 shape: (10, 3) ┌─────────────┬────────────┬──────┐ │ country ┆ date ┆ gdp │ │ --- ┆ --- ┆ --- │ │ str ┆ date ┆ i64 │ ╞═════════════╪════════════╪══════╡ │ Germany ┆ 2016-01-01 ┆ 4164 │ │ Germany ┆ 2017-01-01 ┆ 4411 │ │ Germany ┆ 2018-01-01 ┆ 4566 │ │ Germany ┆ 2019-01-01 ┆ 4696 │ │ Germany ┆ 2020-01-01 ┆ 4827 │ │ Netherlands ┆ 2016-01-01 ┆ 784 │ │ Netherlands ┆ 2017-01-01 ┆ 833 │ │ Netherlands ┆ 2018-01-01 ┆ 914 │ │ Netherlands ┆ 2019-01-01 ┆ 910 │ │ Netherlands ┆ 2020-01-01 ┆ 909 │ └─────────────┴────────────┴──────┘ >>> pop2 = pl.DataFrame( ... { ... "country": ["Germany"] * 3 + ["Netherlands"] * 3, ... "date": [ ... date(2016, 3, 1), ... date(2018, 8, 1), ... date(2019, 1, 1), ... date(2016, 3, 1), ... date(2018, 8, 1), ... date(2019, 1, 1), ... ], ... "population": [82.19, 82.66, 83.12, 17.11, 17.32, 17.40], ... } ... ).sort("country", "date") >>> >>> pop2 shape: (6, 3) ┌─────────────┬────────────┬────────────┐ │ country ┆ date ┆ population │ │ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 │ ╞═════════════╪════════════╪════════════╡ │ Germany ┆ 2016-03-01 ┆ 82.19 │ │ Germany ┆ 2018-08-01 ┆ 82.66 │ │ Germany ┆ 2019-01-01 ┆ 83.12 │ │ Netherlands ┆ 2016-03-01 ┆ 17.11 │ │ Netherlands ┆ 2018-08-01 ┆ 17.32 │ │ Netherlands ┆ 2019-01-01 ┆ 17.4 │ └─────────────┴────────────┴────────────┘ >>> pop2.join_asof(gdp2, by="country", on="date", strategy="nearest") shape: (6, 4) ┌─────────────┬────────────┬────────────┬──────┐ │ country ┆ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ i64 │ ╞═════════════╪════════════╪════════════╪══════╡ │ Germany ┆ 2016-03-01 ┆ 82.19 ┆ 4164 │ │ Germany ┆ 2018-08-01 ┆ 82.66 ┆ 4696 │ │ Germany ┆ 2019-01-01 ┆ 83.12 ┆ 4696 │ │ Netherlands ┆ 2016-03-01 ┆ 17.11 ┆ 784 │ │ Netherlands ┆ 2018-08-01 ┆ 17.32 ┆ 910 │ │ Netherlands ┆ 2019-01-01 ┆ 17.4 ┆ 910 │ └─────────────┴────────────┴────────────┴──────┘
- join_where(other: DataFrame, *predicates: Expr | Iterable[Expr], suffix: str = '_right') DataFrame[source]¶
Perform a join based on one or multiple (in)equality predicates.
This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.
Note
The row order of the input DataFrames is not preserved.
Warning
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
- Parameters:
- other
DataFrame to join with.
- *predicates
(In)Equality condition to join the two tables on. When a column name occurs in both tables, the proper suffix must be applied in the predicate.
- suffix
Suffix to append to columns with a duplicate name.
Examples
Join two dataframes together based on two predicates which get AND-ed together.
>>> east = pl.DataFrame( ... { ... "id": [100, 101, 102], ... "dur": [120, 140, 160], ... "rev": [12, 14, 16], ... "cores": [2, 8, 4], ... } ... ) >>> west = pl.DataFrame( ... { ... "t_id": [404, 498, 676, 742], ... "time": [90, 130, 150, 170], ... "cost": [9, 13, 15, 16], ... "cores": [4, 2, 1, 4], ... } ... ) >>> east.join_where( ... west, ... pl.col("dur") < pl.col("time"), ... pl.col("rev") < pl.col("cost"), ... ) shape: (5, 8) ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐ │ id ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 498 ┆ 130 ┆ 13 ┆ 2 │ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 676 ┆ 150 ┆ 15 ┆ 1 │ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ │ 101 ┆ 140 ┆ 14 ┆ 8 ┆ 676 ┆ 150 ┆ 15 ┆ 1 │ │ 101 ┆ 140 ┆ 14 ┆ 8 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘
To OR them together, use a single expression and the | operator.
>>> east.join_where( ... west, ... (pl.col("dur") < pl.col("time")) | (pl.col("rev") < pl.col("cost")), ... ) shape: (6, 8) ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐ │ id ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 498 ┆ 130 ┆ 13 ┆ 2 │ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 676 ┆ 150 ┆ 15 ┆ 1 │ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ │ 101 ┆ 140 ┆ 14 ┆ 8 ┆ 676 ┆ 150 ┆ 15 ┆ 1 │ │ 101 ┆ 140 ┆ 14 ┆ 8 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ │ 102 ┆ 160 ┆ 16 ┆ 4 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘
- lazy = None¶
- limit(n: int = 5) DataFrame[source]¶
Get the first n rows.
Alias for
DataFrame.head().- Parameters:
- n
Number of rows to return. If a negative value is passed, return all rows except the last abs(n).
See also
Examples
Get the first 3 rows of a DataFrame.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> df.limit(3) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
- map_columns(column_names: str | Sequence[str] | pl.Selector, function: Callable[[Series], Series], *args: P.args, **kwargs: P.kwargs) DataFrame[source]¶
Apply eager functions to columns of a DataFrame.
Users should always prefer
with_columns()unless they are using expressions that are only possible on Series and not on Expr. This is almost never the case, except for a very select few functions that cannot know the output datatype without looking at the data.- Parameters:
- column_names
The columns to apply the UDF to.
- function
Callable; will receive a column series as the first parameter, followed by any given args/kwargs.
- *args
Arguments to pass to the UDF.
- **kwargs
Keyword arguments to pass to the UDF.
See also
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["10", "20", "30", "40"]}) >>> df.map_columns("a", lambda s: s.shrink_dtype()) shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i8 ┆ str │ ╞═════╪═════╡ │ 1 ┆ 10 │ │ 2 ┆ 20 │ │ 3 ┆ 30 │ │ 4 ┆ 40 │ └─────┴─────┘
>>> df = pl.DataFrame( ... { ... "a": ['{"x":"a"}', None, '{"x":"b"}', None], ... "b": ['{"a":1, "b": true}', None, '{"a":2, "b": false}', None], ... } ... ) >>> df.map_columns(["a", "b"], lambda s: s.str.json_decode()) shape: (4, 2) ┌───────────┬───────────┐ │ a ┆ b │ │ --- ┆ --- │ │ struct[1] ┆ struct[2] │ ╞═══════════╪═══════════╡ │ {"a"} ┆ {1,true} │ │ null ┆ null │ │ {"b"} ┆ {2,false} │ │ null ┆ null │ └───────────┴───────────┘ >>> import polars.selectors as cs >>> df.map_columns(cs.all(), lambda s: s.str.json_decode()) shape: (4, 2) ┌───────────┬───────────┐ │ a ┆ b │ │ --- ┆ --- │ │ struct[1] ┆ struct[2] │ ╞═══════════╪═══════════╡ │ {"a"} ┆ {1,true} │ │ null ┆ null │ │ {"b"} ┆ {2,false} │ │ null ┆ null │ └───────────┴───────────┘
- map_rows(function: Callable[[tuple[Any, ...]], Any], return_dtype: PolarsDataType | None = None, *, inference_size: int = 256) DataFrame[source]¶
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
Warning
This method is much slower than the native expressions API. Only use it if you cannot implement your logic otherwise.
The UDF will receive each row as a tuple of values: udf(row).
Implementing logic using a Python function is almost always significantly slower and more memory intensive than implementing the same logic using the native expression API because:
The native expression engine runs in Rust; UDFs run in Python.
Use of Python UDFs forces the DataFrame to be materialized in memory.
Polars-native expressions can be parallelised (UDFs typically cannot).
Polars-native expressions can be logically optimised (UDFs cannot).
Wherever possible you should strongly prefer the native expression API to achieve the best performance.
- Parameters:
- function
Custom function or lambda.
- return_dtype
Output type of the operation. If none given, Polars tries to infer the type.
- inference_size
Only used in the case when the custom function returns rows. This uses the first n rows to determine the output schema.
Notes
The frame-level map_rows cannot track column names (as the UDF is a black-box that may arbitrarily drop, rearrange, transform, or add new columns); if you want to apply a UDF such that column names are preserved, you should use the expression-level map_elements syntax instead.
If your function is expensive and you don’t want it to be called more than once for a given input, consider applying an @lru_cache decorator to it. If your data is suitable you may achieve significant speedups.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [-1, 5, 8]})
Return a DataFrame by mapping each row to a tuple:
>>> df.map_rows(lambda t: (t[0] * 2, t[1] * 3)) shape: (3, 2) ┌──────────┬──────────┐ │ column_0 ┆ column_1 │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════════╪══════════╡ │ 2 ┆ -3 │ │ 4 ┆ 15 │ │ 6 ┆ 24 │ └──────────┴──────────┘
However, it is much better to implement this with a native expression:
>>> df.select( ... pl.col("foo") * 2, ... pl.col("bar") * 3, ... )
Return a DataFrame with a single column by mapping each row to a scalar:
>>> df.map_rows(lambda t: (t[0] * 2 + t[1])) shape: (3, 1) ┌─────┐ │ map │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 9 │ │ 14 │ └─────┘
In this case it is better to use the following native expression:
>>> df.select(pl.col("foo") * 2 + pl.col("bar"))
- match_to_schema(schema: SchemaDict | Schema, *, missing_columns: Literal['insert', 'raise'] | Mapping[str, Literal['insert', 'raise'] | Expr] = 'raise', missing_struct_fields: Literal['insert', 'raise'] | Mapping[str, Literal['insert', 'raise']] = 'raise', extra_columns: Literal['ignore', 'raise'] = 'raise', extra_struct_fields: Literal['ignore', 'raise'] | Mapping[str, Literal['ignore', 'raise']] = 'raise', integer_cast: Literal['upcast', 'forbid'] | Mapping[str, Literal['upcast', 'forbid']] = 'forbid', float_cast: Literal['upcast', 'forbid'] | Mapping[str, Literal['upcast', 'forbid']] = 'forbid') DataFrame[source]¶
Match or evolve the schema of a LazyFrame into a specific schema.
By default, match_to_schema returns an error if the input schema does not exactly match the target schema. It also allows columns to be freely reordered, with additional coercion rules available through optional parameters.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- schema
Target schema to match or evolve to.
- missing_columns
Raise of insert missing columns from the input with respect to the schema.
This can also be an expression per column with what to insert if it is missing.
- missing_struct_fields
Raise of insert missing struct fields from the input with respect to the schema.
- extra_columns
Raise of ignore extra columns from the input with respect to the schema.
- extra_struct_fields
Raise of ignore extra struct fields from the input with respect to the schema.
- integer_cast
Forbid of upcast for integer columns from the input to the respective column in schema.
- float_cast
Forbid of upcast for float columns from the input to the respective column in schema.
Examples
Ensuring the schema matches
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": ["A", "B", "C"]}) >>> df.match_to_schema({"a": pl.Int64, "b": pl.String}) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 1 ┆ A │ │ 2 ┆ B │ │ 3 ┆ C │ └─────┴─────┘ >>> df.match_to_schema({"a": pl.Int64}) polars.exceptions.SchemaError: extra columns in `match_to_schema`: "b"
Adding missing columns
>>> ( ... pl.DataFrame({"a": [1, 2, 3]}).match_to_schema( ... {"a": pl.Int64, "b": pl.String}, ... missing_columns="insert", ... ) ... ) shape: (3, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪══════╡ │ 1 ┆ null │ │ 2 ┆ null │ │ 3 ┆ null │ └─────┴──────┘ >>> ( ... pl.DataFrame({"a": [1, 2, 3]}).match_to_schema( ... {"a": pl.Int64, "b": pl.String}, ... missing_columns={"b": pl.col.a.cast(pl.String)}, ... ) ... ) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 1 ┆ 1 │ │ 2 ┆ 2 │ │ 3 ┆ 3 │ └─────┴─────┘
Removing extra columns
>>> ( ... pl.DataFrame({"a": [1, 2, 3], "b": ["A", "B", "C"]}).match_to_schema( ... {"a": pl.Int64}, ... extra_columns="ignore", ... ) ... ) shape: (3, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Upcasting integers and floats
>>> ( ... pl.DataFrame( ... {"a": [1, 2, 3], "b": [1.0, 2.0, 3.0]}, ... schema={"a": pl.Int32, "b": pl.Float32}, ... ).match_to_schema( ... {"a": pl.Int64, "b": pl.Float64}, ... integer_cast="upcast", ... float_cast="upcast", ... ) ... ) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 1.0 │ │ 2 ┆ 2.0 │ │ 3 ┆ 3.0 │ └─────┴─────┘
- max() DataFrame[source]¶
Aggregate the columns of this DataFrame to their maximum value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.max() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
- max_horizontal() Series[source]¶
Get the maximum value horizontally across columns.
- Returns:
- Series
A Series named “max”.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.max_horizontal() shape: (3,) Series: 'max' [f64] [ 4.0 5.0 6.0 ]
- mean() DataFrame[source]¶
Aggregate the columns of this DataFrame to their mean value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... "spam": [True, False, None], ... } ... ) >>> df.mean() shape: (1, 4) ┌─────┬─────┬──────┬──────┐ │ foo ┆ bar ┆ ham ┆ spam │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str ┆ f64 │ ╞═════╪═════╪══════╪══════╡ │ 2.0 ┆ 7.0 ┆ null ┆ 0.5 │ └─────┴─────┴──────┴──────┘
- mean_horizontal(*, ignore_nulls: bool = True) Series[source]¶
Take the mean of all values horizontally across columns.
- Parameters:
- ignore_nulls
Ignore null values (default). If set to False, any null value in the input will lead to a null output.
- Returns:
- Series
A Series named “mean”.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.mean_horizontal() shape: (3,) Series: 'mean' [f64] [ 2.5 3.5 4.5 ]
- median() DataFrame[source]¶
Aggregate the columns of this DataFrame to their median value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.median() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 2.0 ┆ 7.0 ┆ null │ └─────┴─────┴──────┘
- melt(id_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, value_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, variable_name: str | None = None, value_name: str | None = None) DataFrame[source]¶
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars) while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis leaving just two non-identifier columns, ‘variable’ and ‘value’.
Deprecated since version 1.0.0: Use the
unpivot()method instead.- Parameters:
- id_vars
Column(s) or selector(s) to use as identifier variables.
- value_vars
Column(s) or selector(s) to use as values variables; if value_vars is empty all columns that are not in id_vars will be used.
- variable_name
Name to give to the variable column. Defaults to “variable”
- value_name
Name to give to the value column. Defaults to “value”
- merge_sorted(other: DataFrame, key: str) DataFrame[source]¶
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted in ascending order by that key otherwise the output will not make sense.
The schemas of both DataFrames must be equal.
- Parameters:
- other
Other DataFrame that must be merged
- key
Key that is sorted.
Notes
No guarantee is given over the output row order when the key is equal between the both dataframes.
The key must be sorted in ascending order.
Examples
>>> df0 = pl.DataFrame( ... {"name": ["steve", "elise", "bob"], "age": [42, 44, 18]} ... ).sort("age") >>> df0 shape: (3, 2) ┌───────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═══════╪═════╡ │ bob ┆ 18 │ │ steve ┆ 42 │ │ elise ┆ 44 │ └───────┴─────┘ >>> df1 = pl.DataFrame( ... {"name": ["anna", "megan", "steve", "thomas"], "age": [21, 33, 42, 20]} ... ).sort("age") >>> df1 shape: (4, 2) ┌────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═════╡ │ thomas ┆ 20 │ │ anna ┆ 21 │ │ megan ┆ 33 │ │ steve ┆ 42 │ └────────┴─────┘ >>> df0.merge_sorted(df1, key="age") shape: (7, 2) ┌────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═════╡ │ bob ┆ 18 │ │ thomas ┆ 20 │ │ anna ┆ 21 │ │ megan ┆ 33 │ │ steve ┆ 42 │ │ steve ┆ 42 │ │ elise ┆ 44 │ └────────┴─────┘
- min() DataFrame[source]¶
Aggregate the columns of this DataFrame to their minimum value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.min() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
- min_horizontal() Series[source]¶
Get the minimum value horizontally across columns.
- Returns:
- Series
A Series named “min”.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.min_horizontal() shape: (3,) Series: 'min' [f64] [ 1.0 2.0 3.0 ]
- n_chunks(strategy: Literal['first', 'all'] = 'first') int | list[int][source]¶
Get number of chunks used by the ChunkedArrays of this DataFrame.
- Parameters:
- strategy{‘first’, ‘all’}
Return the number of chunks of the ‘first’ column, or ‘all’ columns in this DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.n_chunks() 1 >>> df.n_chunks(strategy="all") [1, 1, 1]
- n_unique(subset: str | Expr | Sequence[str | Expr] | None = None) int[source]¶
Return the number of unique rows, or the number of unique row-subsets.
- Parameters:
- subset
One or more columns/expressions that define what to count; omit to return the count of unique rows.
Notes
This method operates at the DataFrame level; to operate on subsets at the expression level you can make use of struct-packing instead, for example:
>>> expr_unique_subset = pl.struct("a", "b").n_unique()
If instead you want to count the number of unique values per-column, you can also use expression-level syntax to return a new frame containing that result:
>>> df = pl.DataFrame( ... [[1, 2, 3], [1, 2, 4]], schema=["a", "b", "c"], orient="row" ... ) >>> df_nunique = df.select(pl.all().n_unique())
In aggregate context there is also an equivalent method for returning the unique values per-group:
>>> df_agg_nunique = df.group_by("a").n_unique()
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 1, 2, 3, 4, 5], ... "b": [0.5, 0.5, 1.0, 2.0, 3.0, 3.0], ... "c": [True, True, True, False, True, True], ... } ... ) >>> df.n_unique() 5
Simple columns subset.
>>> df.n_unique(subset=["b", "c"]) 4
Expression subset.
>>> df.n_unique( ... subset=[ ... (pl.col("a") // 2), ... (pl.col("c") | (pl.col("b") >= 2)), ... ], ... ) 3
- null_count() DataFrame[source]¶
Create a new DataFrame that shows the null counts per column.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, None, 3], ... "bar": [6, 7, None], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.null_count() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 │ ╞═════╪═════╪═════╡ │ 1 ┆ 1 ┆ 0 │ └─────┴─────┴─────┘
- partition_by(by: ColumnNameOrSelector | Sequence[ColumnNameOrSelector], *more_by: ColumnNameOrSelector, maintain_order: bool = True, include_key: bool = True, as_dict: bool = False) list[DataFrame] | dict[tuple[Any, ...], DataFrame][source]¶
Group by the given columns and return the groups as separate dataframes.
- Parameters:
- by
Column name(s) or selector(s) to group by.
- *more_by
Additional names of columns to group by, specified as positional arguments.
- maintain_order
Ensure that the order of the groups is consistent with the input data. This is slower than a default partition by operation.
- include_key
Include the columns used to partition the DataFrame in the output.
- as_dict
Return a dictionary instead of a list. The dictionary keys are tuples of the distinct group values that identify each group.
Examples
Pass a single column name to partition by that column.
>>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> df.partition_by("a") [shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘]
Partition by multiple columns by either passing a list of column names, or by specifying each column name as a positional argument.
>>> df.partition_by("a", "b") [shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘]
Return the partitions as a dictionary by specifying as_dict=True.
>>> import polars.selectors as cs >>> df.partition_by(cs.string(), as_dict=True) {('a',): shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, ('b',): shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, ('c',): shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘}
- pipe(function: ~collections.abc.Callable[[~typing.Concatenate[~dataframely._typing.DataFrame[~dataframely._typing.S], ~P]], ~dataframely._typing.R], *args: ~typing.~P, **kwargs: ~typing.~P) R[source]¶
Offers a structured way to apply a sequence of user-defined functions (UDFs).
- Parameters:
- function
Callable; will receive the frame as the first parameter, followed by any given args/kwargs.
- *args
Arguments to pass to the UDF.
- **kwargs
Keyword arguments to pass to the UDF.
Notes
It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See
df.lazy().Examples
>>> def cast_str_to_int(data, col_name): ... return data.with_columns(pl.col(col_name).cast(pl.Int64)) >>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["10", "20", "30", "40"]}) >>> df.pipe(cast_str_to_int, col_name="b") shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 10 │ │ 2 ┆ 20 │ │ 3 ┆ 30 │ │ 4 ┆ 40 │ └─────┴─────┘
>>> df = pl.DataFrame({"b": [1, 2], "a": [3, 4]}) >>> df shape: (2, 2) ┌─────┬─────┐ │ b ┆ a │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘ >>> df.pipe(lambda tdf: tdf.select(sorted(tdf.columns))) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 1 │ │ 4 ┆ 2 │ └─────┴─────┘
- pivot(on: ColumnNameOrSelector | Sequence[ColumnNameOrSelector], *, index: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, values: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, aggregate_function: PivotAgg | Expr | None = None, maintain_order: bool = True, sort_columns: bool = False, separator: str = '_') DataFrame[source]¶
Create a spreadsheet-style pivot table as a DataFrame.
Only available in eager mode. See “Examples” section below for how to do a “lazy pivot” if you know the unique column values in advance.
Changed in version 1.0.0: The columns parameter was renamed on.
- Parameters:
- on
The column(s) whose values will be used as the new columns of the output DataFrame.
- index
The column(s) that remain from the input to the output. The output DataFrame will have one row for each unique combination of the index’s values. If None, all remaining columns not specified on on and values will be used. At least one of index and values must be specified.
- values
The existing column(s) of values which will be moved under the new columns from index. If an aggregation is specified, these are the values on which the aggregation will be computed. If None, all remaining columns not specified on on and index will be used. At least one of index and values must be specified.
- aggregate_function
Choose from:
None: no aggregation takes place, will raise error if multiple values are in group.
A predefined aggregate function string, one of {‘min’, ‘max’, ‘first’, ‘last’, ‘sum’, ‘mean’, ‘median’, ‘len’}
An expression to do the aggregation. The expression can only access data from the respective ‘values’ columns as generated by pivot, through pl.element().
- maintain_order
Ensure the values of index are sorted by discovery order.
- sort_columns
Sort the transposed columns by name. Default is by order of discovery.
- separator
Used as separator/delimiter in generated column names in case of multiple values columns.
- Returns:
- DataFrame
Notes
In some other frameworks, you might know this operation as pivot_wider.
Examples
You can use pivot to reshape a dataframe from “long” to “wide” format.
For example, suppose we have a dataframe of test scores achieved by some students, where each row represents a distinct test.
>>> df = pl.DataFrame( ... { ... "name": ["Cady", "Cady", "Karen", "Karen"], ... "subject": ["maths", "physics", "maths", "physics"], ... "test_1": [98, 99, 61, 58], ... "test_2": [100, 100, 60, 60], ... } ... ) >>> df shape: (4, 4) ┌───────┬─────────┬────────┬────────┐ │ name ┆ subject ┆ test_1 ┆ test_2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ i64 │ ╞═══════╪═════════╪════════╪════════╡ │ Cady ┆ maths ┆ 98 ┆ 100 │ │ Cady ┆ physics ┆ 99 ┆ 100 │ │ Karen ┆ maths ┆ 61 ┆ 60 │ │ Karen ┆ physics ┆ 58 ┆ 60 │ └───────┴─────────┴────────┴────────┘
Using pivot, we can reshape so we have one row per student, with different subjects as columns, and their test_1 scores as values:
>>> df.pivot("subject", index="name", values="test_1") shape: (2, 3) ┌───────┬───────┬─────────┐ │ name ┆ maths ┆ physics │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═══════╪═══════╪═════════╡ │ Cady ┆ 98 ┆ 99 │ │ Karen ┆ 61 ┆ 58 │ └───────┴───────┴─────────┘
You can use selectors too - here we include all test scores in the pivoted table:
>>> import polars.selectors as cs >>> df.pivot("subject", values=cs.starts_with("test")) shape: (2, 5) ┌───────┬──────────────┬────────────────┬──────────────┬────────────────┐ │ name ┆ test_1_maths ┆ test_1_physics ┆ test_2_maths ┆ test_2_physics │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═══════╪══════════════╪════════════════╪══════════════╪════════════════╡ │ Cady ┆ 98 ┆ 99 ┆ 100 ┆ 100 │ │ Karen ┆ 61 ┆ 58 ┆ 60 ┆ 60 │ └───────┴──────────────┴────────────────┴──────────────┴────────────────┘
If you end up with multiple values per cell, you can specify how to aggregate them with aggregate_function:
>>> df = pl.DataFrame( ... { ... "ix": [1, 1, 2, 2, 1, 2], ... "col": ["a", "a", "a", "a", "b", "b"], ... "foo": [0, 1, 2, 2, 7, 1], ... "bar": [0, 2, 0, 0, 9, 4], ... } ... ) >>> df.pivot("col", index="ix", aggregate_function="sum") shape: (2, 5) ┌─────┬───────┬───────┬───────┬───────┐ │ ix ┆ foo_a ┆ foo_b ┆ bar_a ┆ bar_b │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═══════╪═══════╪═══════╪═══════╡ │ 1 ┆ 1 ┆ 7 ┆ 2 ┆ 9 │ │ 2 ┆ 4 ┆ 1 ┆ 0 ┆ 4 │ └─────┴───────┴───────┴───────┴───────┘
You can also pass a custom aggregation function using
polars.element():>>> df = pl.DataFrame( ... { ... "col1": ["a", "a", "a", "b", "b", "b"], ... "col2": ["x", "x", "x", "x", "y", "y"], ... "col3": [6, 7, 3, 2, 5, 7], ... } ... ) >>> df.pivot( ... "col2", ... index="col1", ... values="col3", ... aggregate_function=pl.element().tanh().mean(), ... ) shape: (2, 3) ┌──────┬──────────┬──────────┐ │ col1 ┆ x ┆ y │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞══════╪══════════╪══════════╡ │ a ┆ 0.998347 ┆ null │ │ b ┆ 0.964028 ┆ 0.999954 │ └──────┴──────────┴──────────┘
Note that pivot is only available in eager mode. If you know the unique column values in advance, you can use
polars.LazyFrame.group_by()to get the same result as above in lazy mode:>>> index = pl.col("col1") >>> on = pl.col("col2") >>> values = pl.col("col3") >>> unique_column_values = ["x", "y"] >>> aggregate_function = lambda col: col.tanh().mean() >>> df.lazy().group_by(index).agg( ... aggregate_function(values.filter(on == value)).alias(value) ... for value in unique_column_values ... ).collect() shape: (2, 3) ┌──────┬──────────┬──────────┐ │ col1 ┆ x ┆ y │ │ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 │ ╞══════╪══════════╪══════════╡ │ a ┆ 0.998347 ┆ null │ │ b ┆ 0.964028 ┆ 0.999954 │ └──────┴──────────┴──────────┘
- property plot: DataFramePlot¶
Create a plot namespace.
Warning
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
Changed in version 1.6.0: In prior versions of Polars, HvPlot was the plotting backend. If you would like to restore the previous plotting functionality, all you need to do is add import hvplot.polars at the top of your script and replace df.plot with df.hvplot.
Polars does not implement plotting logic itself, but instead defers to Altair:
df.plot.line(**kwargs) is shorthand for alt.Chart(df).mark_line(tooltip=True).encode(**kwargs).interactive()
df.plot.point(**kwargs) is shorthand for alt.Chart(df).mark_point(tooltip=True).encode(**kwargs).interactive() (and plot.scatter is provided as an alias)
df.plot.bar(**kwargs) is shorthand for alt.Chart(df).mark_bar(tooltip=True).encode(**kwargs).interactive()
for any other attribute attr, df.plot.attr(**kwargs) is shorthand for alt.Chart(df).mark_attr(tooltip=True).encode(**kwargs).interactive()
For configuration, we suggest reading Chart Configuration. For example, you can:
Change the width/height/title with
.properties(width=500, height=350, title="My amazing plot").Change the x-axis label rotation with
.configure_axisX(labelAngle=30).Change the opacity of the points in your scatter plot with
.configure_point(opacity=.5).
Examples
Scatter plot:
>>> df = pl.DataFrame( ... { ... "length": [1, 4, 6], ... "width": [4, 5, 6], ... "species": ["setosa", "setosa", "versicolor"], ... } ... ) >>> df.plot.point(x="length", y="width", color="species")
Set the x-axis title by using
altair.X:>>> import altair as alt >>> df.plot.point( ... x=alt.X("length", title="Length"), y="width", color="species" ... )
Line plot:
>>> from datetime import date >>> df = pl.DataFrame( ... { ... "date": [date(2020, 1, 2), date(2020, 1, 3), date(2020, 1, 4)] * 2, ... "price": [1, 4, 6, 1, 5, 2], ... "stock": ["a", "a", "a", "b", "b", "b"], ... } ... ) >>> df.plot.line(x="date", y="price", color="stock")
Bar plot:
>>> df = pl.DataFrame( ... { ... "day": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] * 2, ... "group": ["a"] * 7 + ["b"] * 7, ... "value": [1, 3, 2, 4, 5, 6, 1, 1, 3, 2, 4, 5, 1, 2], ... } ... ) >>> df.plot.bar( ... x="day", y="value", color="day", column="group" ... )
Or, to make a stacked version of the plot above:
>>> df.plot.bar(x="day", y="value", color="group")
- product() DataFrame[source]¶
Aggregate the columns of this DataFrame to their product values.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3], ... "b": [0.5, 4, 10], ... "c": [True, True, False], ... } ... )
>>> df.product() shape: (1, 3) ┌─────┬──────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ i64 │ ╞═════╪══════╪═════╡ │ 6 ┆ 20.0 ┆ 0 │ └─────┴──────┴─────┘
- quantile(quantile: float, interpolation: QuantileMethod = 'nearest') DataFrame[source]¶
Aggregate the columns of this DataFrame to their quantile value.
- Parameters:
- quantile
Quantile between 0.0 and 1.0.
- interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’, ‘equiprobable’}
Interpolation method.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.quantile(0.5, "nearest") shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 2.0 ┆ 7.0 ┆ null │ └─────┴─────┴──────┘
- rechunk = None¶
- remove(*predicates: IntoExprColumn | Iterable[IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any], **constraints: Any) DataFrame[source]¶
Remove rows, dropping those that match the given predicate expression(s).
The original order of the remaining rows is preserved.
Rows where the filter predicate does not evaluate to True are retained (this includes rows where the predicate evaluates as null).
- Parameters:
- predicates
Expression that evaluates to a boolean Series.
- constraints
Column filters; use name = value to filter columns using the supplied value. Each constraint behaves the same as pl.col(name).eq(value), and is implicitly joined with the other filter conditions using &.
See also
Notes
If you are transitioning from Pandas, and performing filter operations based on the comparison of two or more columns, please note that in Polars any comparison involving null values will result in a null result, not boolean True or False. As a result, these rows will not be removed. Ensure that null values are handled appropriately to avoid unexpected behaviour (see examples below).
Examples
>>> df = pl.DataFrame( ... { ... "foo": [2, 3, None, 4, 0], ... "bar": [5, 6, None, None, 0], ... "ham": ["a", "b", None, "c", "d"], ... } ... )
Remove rows matching a condition:
>>> df.remove(pl.col("bar") >= 5) shape: (3, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ 4 ┆ null ┆ c │ │ 0 ┆ 0 ┆ d │ └──────┴──────┴──────┘
Discard rows based on multiple conditions, combined with and/or operators:
>>> df.remove( ... (pl.col("foo") >= 0) & (pl.col("bar") >= 0), ... ) shape: (2, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ 4 ┆ null ┆ c │ └──────┴──────┴──────┘
>>> df.remove( ... (pl.col("foo") >= 0) | (pl.col("bar") >= 0), ... ) shape: (1, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ └──────┴──────┴──────┘
Provide multiple constraints using *args syntax:
>>> df.remove( ... pl.col("ham").is_not_null(), ... pl.col("bar") >= 0, ... ) shape: (2, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ 4 ┆ null ┆ c │ └──────┴──────┴──────┘
Provide constraints(s) using **kwargs syntax:
>>> df.remove(foo=0, bar=0) shape: (4, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ 2 ┆ 5 ┆ a │ │ 3 ┆ 6 ┆ b │ │ null ┆ null ┆ null │ │ 4 ┆ null ┆ c │ └──────┴──────┴──────┘
Remove rows by comparing two columns against each other:
>>> df.remove( ... pl.col("foo").ne_missing(pl.col("bar")), ... ) shape: (2, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ 0 ┆ 0 ┆ d │ └──────┴──────┴──────┘
- rename(mapping: Mapping[str, str] | Callable[[str], str], *, strict: bool = True) DataFrame[source]¶
Rename column names.
- Parameters:
- mapping
Key value pairs that map from old name to new name, or a function that takes the old name as input and returns the new name.
- strict
Validate that all column names exist in the current schema, and throw an exception if any do not. (Note that this parameter is a no-op when passing a function to mapping).
Examples
>>> df = pl.DataFrame( ... {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]} ... ) >>> df.rename({"foo": "apple"}) shape: (3, 3) ┌───────┬─────┬─────┐ │ apple ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └───────┴─────┴─────┘ >>> df.rename(lambda column_name: "c" + column_name[1:]) shape: (3, 3) ┌─────┬─────┬─────┐ │ coo ┆ car ┆ cam │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
- replace_column(index: int, column: Series) DataFrame[source]¶
Replace a column at an index location.
This operation is in place.
- Parameters:
- index
Column index.
- column
Series that will replace the column.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> s = pl.Series("apple", [10, 20, 30]) >>> df.replace_column(0, s) shape: (3, 3) ┌───────┬─────┬─────┐ │ apple ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪═════╪═════╡ │ 10 ┆ 6 ┆ a │ │ 20 ┆ 7 ┆ b │ │ 30 ┆ 8 ┆ c │ └───────┴─────┴─────┘
- reverse() DataFrame[source]¶
Reverse the DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "key": ["a", "b", "c"], ... "val": [1, 2, 3], ... } ... ) >>> df.reverse() shape: (3, 2) ┌─────┬─────┐ │ key ┆ val │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ c ┆ 3 │ │ b ┆ 2 │ │ a ┆ 1 │ └─────┴─────┘
- rolling(index_column: IntoExpr, *, period: str | timedelta, offset: str | timedelta | None = None, closed: ClosedInterval = 'right', group_by: IntoExpr | Iterable[IntoExpr] | None = None) RollingGroupBy[source]¶
Create rolling groups based on a temporal or integer column.
Different from a group_by_dynamic the windows are now determined by the individual values and are not of constant intervals. For constant intervals use
DataFrame.group_by_dynamic().If you have a time series <t_0, t_1, …, t_n>, then by default the windows created will be
(t_0 - period, t_0]
(t_1 - period, t_1]
…
(t_n - period, t_n]
whereas if you pass a non-default offset, then the windows will be
(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
…
(t_n + offset, t_n + offset + period]
The period and offset arguments are created either from a timedelta, or by using the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
Changed in version 0.20.14: The by parameter was renamed group_by.
- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if group_by is specified, then it must be sorted in ascending order within each group).
In case of a rolling operation on indices, dtype needs to be one of {UInt32, UInt64, Int32, Int64}. Note that the first three get temporarily cast to Int64, so if performance matters use an Int64 column.
- period
Length of the window - must be non-negative.
- offset
Offset of the window. Default is -period.
- closed{‘right’, ‘left’, ‘both’, ‘none’}
Define which sides of the temporal interval are closed (inclusive).
- group_by
Also group by this column/these columns
- Returns:
- RollingGroupBy
Object you can call .agg on to aggregate by groups, the result of which will be sorted by index_column (but note that if group_by columns are passed, it will only be sorted within each group).
See also
Examples
>>> dates = [ ... "2020-01-01 13:45:48", ... "2020-01-01 16:42:13", ... "2020-01-01 16:45:09", ... "2020-01-02 18:12:48", ... "2020-01-03 19:45:32", ... "2020-01-08 23:16:43", ... ] >>> df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_columns( ... pl.col("dt").str.strptime(pl.Datetime).set_sorted() ... ) >>> out = df.rolling(index_column="dt", period="2d").agg( ... [ ... pl.sum("a").alias("sum_a"), ... pl.min("a").alias("min_a"), ... pl.max("a").alias("max_a"), ... ] ... ) >>> assert out["sum_a"].to_list() == [3, 10, 15, 24, 11, 1] >>> assert out["max_a"].to_list() == [3, 7, 7, 9, 9, 1] >>> assert out["min_a"].to_list() == [3, 3, 3, 3, 2, 1] >>> out shape: (6, 4) ┌─────────────────────┬───────┬───────┬───────┐ │ dt ┆ sum_a ┆ min_a ┆ max_a │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │ ╞═════════════════════╪═══════╪═══════╪═══════╡ │ 2020-01-01 13:45:48 ┆ 3 ┆ 3 ┆ 3 │ │ 2020-01-01 16:42:13 ┆ 10 ┆ 3 ┆ 7 │ │ 2020-01-01 16:45:09 ┆ 15 ┆ 3 ┆ 7 │ │ 2020-01-02 18:12:48 ┆ 24 ┆ 3 ┆ 9 │ │ 2020-01-03 19:45:32 ┆ 11 ┆ 2 ┆ 9 │ │ 2020-01-08 23:16:43 ┆ 1 ┆ 1 ┆ 1 │ └─────────────────────┴───────┴───────┴───────┘
If you use an index count in period or offset, then it’s based on the values in index_column:
>>> df = pl.DataFrame({"int": [0, 4, 5, 6, 8], "value": [1, 4, 2, 4, 1]}) >>> df.rolling("int", period="3i").agg(pl.col("int").alias("aggregated")) shape: (5, 2) ┌─────┬────────────┐ │ int ┆ aggregated │ │ --- ┆ --- │ │ i64 ┆ list[i64] │ ╞═════╪════════════╡ │ 0 ┆ [0] │ │ 4 ┆ [4] │ │ 5 ┆ [4, 5] │ │ 6 ┆ [4, 5, 6] │ │ 8 ┆ [6, 8] │ └─────┴────────────┘
If you want the index count to be based on row number, then you may want to combine rolling with
with_row_index().
- row(index: int | None = None, *, by_predicate: Expr | None = None, named: bool = False) tuple[Any, ...] | dict[str, Any][source]¶
Get the values of a single row, either by index or by predicate.
- Parameters:
- index
Row index.
- by_predicate
Select the row according to a given expression/predicate.
- named
Return a dictionary instead of a tuple. The dictionary is a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name.
- Returns:
- tuple (default) or dictionary of row values
Warning
You should NEVER use this method to iterate over a DataFrame; if you require row-iteration you should strongly prefer use of iter_rows() instead.
See also
Notes
The index and by_predicate params are mutually exclusive. Additionally, to ensure clarity, the by_predicate parameter must be supplied by keyword.
When using by_predicate it is an error condition if anything other than one row is returned; more than one row raises TooManyRowsReturnedError, and zero rows will raise NoRowsReturnedError (both inherit from RowsError).
Examples
Specify an index to return the row at the given index as a tuple.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.row(2) (3, 8, 'c')
Specify named=True to get a dictionary instead with a mapping of column names to row values.
>>> df.row(2, named=True) {'foo': 3, 'bar': 8, 'ham': 'c'}
Use by_predicate to return the row that matches the given predicate.
>>> df.row(by_predicate=(pl.col("ham") == "b")) (2, 7, 'b')
- rows(*, named: bool = False) list[tuple[Any, ...]] | list[dict[str, Any]][source]¶
Returns all data in the DataFrame as a list of rows of python-native values.
By default, each row is returned as a tuple of values given in the same order as the frame columns. Setting named=True will return rows of dictionaries instead.
- Parameters:
- named
Return dictionaries instead of tuples. The dictionaries are a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name.
- Returns:
- list of row value tuples (default), or list of dictionaries (if named=True).
Warning
Row-iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods. You should also consider using iter_rows instead, to avoid materialising all the data at once; there is little performance difference between the two, but peak memory can be reduced if processing rows in batches.
See also
iter_rowsRow iterator over frame data (does not materialise all rows).
rows_by_keyMaterialises frame data as a key-indexed dictionary.
Notes
If you have ns-precision temporal values you should be aware that Python natively only supports up to μs-precision; ns-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).
Examples
>>> df = pl.DataFrame( ... { ... "x": ["a", "b", "b", "a"], ... "y": [1, 2, 3, 4], ... "z": [0, 3, 6, 9], ... } ... ) >>> df.rows() [('a', 1, 0), ('b', 2, 3), ('b', 3, 6), ('a', 4, 9)] >>> df.rows(named=True) [{'x': 'a', 'y': 1, 'z': 0}, {'x': 'b', 'y': 2, 'z': 3}, {'x': 'b', 'y': 3, 'z': 6}, {'x': 'a', 'y': 4, 'z': 9}]
- rows_by_key(key: ColumnNameOrSelector | Sequence[ColumnNameOrSelector], *, named: bool = False, include_key: bool = False, unique: bool = False) dict[Any, Any][source]¶
Returns all data as a dictionary of python-native values keyed by some column.
This method is like rows, but instead of returning rows in a flat list, rows are grouped by the values in the key column(s) and returned as a dictionary.
Note that this method should not be used in place of native operations, due to the high cost of materializing all frame data out into a dictionary; it should be used only when you need to move the values out into a Python data structure or other object that cannot operate directly with Polars/Arrow.
- Parameters:
- key
The column(s) to use as the key for the returned dictionary. If multiple columns are specified, the key will be a tuple of those values, otherwise it will be a string.
- named
Return dictionary rows instead of tuples, mapping column name to row value.
- include_key
Include key values inline with the associated data (by default the key values are omitted as a memory/performance optimisation, as they can be reoconstructed from the key).
- unique
Indicate that the key is unique; this will result in a 1:1 mapping from key to a single associated row. Note that if the key is not actually unique the last row with the given key will be returned.
See also
Notes
If you have ns-precision temporal values you should be aware that Python natively only supports up to μs-precision; ns-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).
Examples
>>> df = pl.DataFrame( ... { ... "w": ["a", "b", "b", "a"], ... "x": ["q", "q", "q", "k"], ... "y": [1.0, 2.5, 3.0, 4.5], ... "z": [9, 8, 7, 6], ... } ... )
Group rows by the given key column(s):
>>> df.rows_by_key(key=["w"]) defaultdict(<class 'list'>, {'a': [('q', 1.0, 9), ('k', 4.5, 6)], 'b': [('q', 2.5, 8), ('q', 3.0, 7)]})
Return the same row groupings as dictionaries:
>>> df.rows_by_key(key=["w"], named=True) defaultdict(<class 'list'>, {'a': [{'x': 'q', 'y': 1.0, 'z': 9}, {'x': 'k', 'y': 4.5, 'z': 6}], 'b': [{'x': 'q', 'y': 2.5, 'z': 8}, {'x': 'q', 'y': 3.0, 'z': 7}]})
Return row groupings, assuming keys are unique:
>>> df.rows_by_key(key=["z"], unique=True) {9: ('a', 'q', 1.0), 8: ('b', 'q', 2.5), 7: ('b', 'q', 3.0), 6: ('a', 'k', 4.5)}
Return row groupings as dictionaries, assuming keys are unique:
>>> df.rows_by_key(key=["z"], named=True, unique=True) {9: {'w': 'a', 'x': 'q', 'y': 1.0}, 8: {'w': 'b', 'x': 'q', 'y': 2.5}, 7: {'w': 'b', 'x': 'q', 'y': 3.0}, 6: {'w': 'a', 'x': 'k', 'y': 4.5}}
Return dictionary rows grouped by a compound key, including key values:
>>> df.rows_by_key(key=["w", "x"], named=True, include_key=True) defaultdict(<class 'list'>, {('a', 'q'): [{'w': 'a', 'x': 'q', 'y': 1.0, 'z': 9}], ('b', 'q'): [{'w': 'b', 'x': 'q', 'y': 2.5, 'z': 8}, {'w': 'b', 'x': 'q', 'y': 3.0, 'z': 7}], ('a', 'k'): [{'w': 'a', 'x': 'k', 'y': 4.5, 'z': 6}]})
- sample(n: int | Series | None = None, *, fraction: float | Series | None = None, with_replacement: bool = False, shuffle: bool = False, seed: int | None = None) DataFrame[source]¶
Sample from this DataFrame.
- Parameters:
- n
Number of items to return. Cannot be used with fraction. Defaults to 1 if fraction is None.
- fraction
Fraction of items to return. Cannot be used with n.
- with_replacement
Allow values to be sampled more than once.
- shuffle
If set to True, the order of the sampled rows will be shuffled. If set to False (default), the order of the returned rows will be neither stable nor fully random.
- seed
Seed for the random number generator. If set to None (default), a random seed is generated for each sample operation.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.sample(n=2, seed=0) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘
- property schema: Schema¶
Get an ordered mapping of column names to their data type.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.schema Schema({'foo': Int64, 'bar': Float64, 'ham': String})
- select(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) DataFrame[source]¶
Select columns from this DataFrame.
- Parameters:
- *exprs
Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
Examples
Pass the name of a column to select that column.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.select("foo") shape: (3, 1) ┌─────┐ │ foo │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Multiple columns can be selected by passing a list of column names.
>>> df.select(["foo", "bar"]) shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 6 │ │ 2 ┆ 7 │ │ 3 ┆ 8 │ └─────┴─────┘
Multiple columns can also be selected using positional arguments instead of a list. Expressions are also accepted.
>>> df.select(pl.col("foo"), pl.col("bar") + 1) shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ │ 3 ┆ 9 │ └─────┴─────┘
Use keyword arguments to easily name your expression inputs.
>>> df.select(threshold=pl.when(pl.col("foo") > 2).then(10).otherwise(0)) shape: (3, 1) ┌───────────┐ │ threshold │ │ --- │ │ i32 │ ╞═══════════╡ │ 0 │ │ 0 │ │ 10 │ └───────────┘
- select_seq(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) DataFrame[source]¶
Select columns from this DataFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
- Parameters:
- *exprs
Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
See also
- serialize(file: IOBase | str | Path | None = None, *, format: SerializationFormat = 'binary') bytes | str | None[source]¶
Serialize this DataFrame to a file or string in JSON format.
- Parameters:
- file
File path or writable file-like object to which the result will be written. If set to None (default), the output is returned as a string instead.
- format
The format in which to serialize. Options:
“binary”: Serialize to binary format (bytes). This is the default.
“json”: Serialize to JSON format (string).
Notes
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Examples
Serialize the DataFrame into a binary representation.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... } ... ) >>> bytes = df.serialize() >>> type(bytes) <class 'bytes'>
The bytes can later be deserialized back into a DataFrame.
>>> import io >>> pl.DataFrame.deserialize(io.BytesIO(bytes)) shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 6 │ │ 2 ┆ 7 │ │ 3 ┆ 8 │ └─────┴─────┘
- set_sorted = None¶
- property shape: tuple[int, int]¶
Get the shape of the DataFrame.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) >>> df.shape (5, 1)
- shift(n: int = 1, *, fill_value: IntoExpr | None = None) DataFrame[source]¶
Shift values by the given number of indices.
- Parameters:
- n
Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead.
- fill_value
Fill the resulting null values with this value. Accepts scalar expression input. Non-expression inputs are parsed as literals.
Notes
This method is similar to the LAG operation in SQL when the value for n is positive. With a negative value for n, it is similar to LEAD.
Examples
By default, values are shifted forward by one index.
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [5, 6, 7, 8], ... } ... ) >>> df.shift() shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ null ┆ null │ │ 1 ┆ 5 │ │ 2 ┆ 6 │ │ 3 ┆ 7 │ └──────┴──────┘
Pass a negative value to shift in the opposite direction instead.
>>> df.shift(-2) shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 3 ┆ 7 │ │ 4 ┆ 8 │ │ null ┆ null │ │ null ┆ null │ └──────┴──────┘
Specify fill_value to fill the resulting null values.
>>> df.shift(-2, fill_value=100) shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 7 │ │ 4 ┆ 8 │ │ 100 ┆ 100 │ │ 100 ┆ 100 │ └─────┴─────┘
- shrink_to_fit = None¶
- slice(offset: int, length: int | None = None) DataFrame[source]¶
Get a slice of this DataFrame.
- Parameters:
- offset
Start index. Negative indexing is supported.
- length
Length of the slice. If set to None, all rows starting at the offset will be selected.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.slice(1, 2) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- sort(by: IntoExpr | Iterable[IntoExpr], *more_by: IntoExpr, descending: bool | Sequence[bool] = False, nulls_last: bool | Sequence[bool] = False, multithreaded: bool = True, maintain_order: bool = False) DataFrame[source]¶
Sort the dataframe by the given columns.
- Parameters:
- by
Column(s) to sort by. Accepts expression input, including selectors. Strings are parsed as column names.
- *more_by
Additional columns to sort by, specified as positional arguments.
- descending
Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans.
- nulls_last
Place null values last; can specify a single boolean applying to all columns or a sequence of booleans for per-column control.
- multithreaded
Sort using multiple threads.
- maintain_order
Whether the order should be maintained if elements are equal.
Examples
Pass a single column name to sort by that column.
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None], ... "b": [6.0, 5.0, 4.0], ... "c": ["a", "c", "b"], ... } ... ) >>> df.sort("a") shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘
Sorting by expressions is also supported.
>>> df.sort(pl.col("a") + pl.col("b") * 2, nulls_last=True) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ └──────┴─────┴─────┘
Sort by multiple columns by passing a list of columns.
>>> df.sort(["c", "a"], descending=True) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ └──────┴─────┴─────┘
Or use positional arguments to sort by multiple columns in the same way.
>>> df.sort("c", "a", descending=[False, True]) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘
- sql(query: str, *, table_name: str = 'self') DataFrame[source]¶
Execute a SQL query against the DataFrame.
Added in version 0.20.24.
Warning
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- query
SQL query to execute.
- table_name
Optionally provide an explicit name for the table that represents the calling frame (defaults to “self”).
See also
SQLContext
Notes
The calling frame is automatically registered as a table in the SQL context under the name “self”. If you want access to the DataFrames and LazyFrames found in the current globals, use the top-level
pl.sql.More control over registration and execution behaviour is available by using the
SQLContextobject.The SQL query executes in lazy mode before being collected and returned as a DataFrame.
Examples
>>> from datetime import date >>> df1 = pl.DataFrame( ... { ... "a": [1, 2, 3], ... "b": ["zz", "yy", "xx"], ... "c": [date(1999, 12, 31), date(2010, 10, 10), date(2077, 8, 8)], ... } ... )
Query the DataFrame using SQL:
>>> df1.sql("SELECT c, b FROM self WHERE a > 1") shape: (2, 2) ┌────────────┬─────┐ │ c ┆ b │ │ --- ┆ --- │ │ date ┆ str │ ╞════════════╪═════╡ │ 2010-10-10 ┆ yy │ │ 2077-08-08 ┆ xx │ └────────────┴─────┘
Apply transformations to a DataFrame using SQL, aliasing “self” to “frame”.
>>> df1.sql( ... query=''' ... SELECT ... a, ... (a % 2 == 0) AS a_is_even, ... CONCAT_WS(':', b, b) AS b_b, ... EXTRACT(year FROM c) AS year, ... 0::float4 AS "zero", ... FROM frame ... ''', ... table_name="frame", ... ) shape: (3, 5) ┌─────┬───────────┬───────┬──────┬──────┐ │ a ┆ a_is_even ┆ b_b ┆ year ┆ zero │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ bool ┆ str ┆ i32 ┆ f32 │ ╞═════╪═══════════╪═══════╪══════╪══════╡ │ 1 ┆ false ┆ zz:zz ┆ 1999 ┆ 0.0 │ │ 2 ┆ true ┆ yy:yy ┆ 2010 ┆ 0.0 │ │ 3 ┆ false ┆ xx:xx ┆ 2077 ┆ 0.0 │ └─────┴───────────┴───────┴──────┴──────┘
- std(ddof: int = 1) DataFrame[source]¶
Aggregate the columns of this DataFrame to their standard deviation value.
- Parameters:
- ddof
“Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.std() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 1.0 ┆ 1.0 ┆ null │ └─────┴─────┴──────┘ >>> df.std(ddof=0) shape: (1, 3) ┌──────────┬──────────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════════╪══════════╪══════╡ │ 0.816497 ┆ 0.816497 ┆ null │ └──────────┴──────────┴──────┘
- property style: GT¶
Create a Great Table for styling.
Warning
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
Polars does not implement styling logic itself, but instead defers to the Great Tables package. Please see the Great Tables reference for more information and documentation.
Examples
Import some styling helpers, and create example data:
>>> import polars.selectors as cs >>> from great_tables import loc, style >>> df = pl.DataFrame( ... { ... "site_id": [0, 1, 2], ... "measure_a": [5, 4, 6], ... "measure_b": [7, 3, 3], ... } ... )
Emphasize the site_id as row names:
>>> df.style.tab_stub(rowname_col="site_id")
Fill the background for the highest measure_a value row:
>>> df.style.tab_style( ... style.fill("yellow"), ... loc.body(rows=pl.col("measure_a") == pl.col("measure_a").max()), ... )
Put a spanner (high-level label) over measure columns:
>>> df.style.tab_spanner( ... "Measures", cs.starts_with("measure") ... )
Format measure_b values to two decimal places:
>>> df.style.fmt_number("measure_b", decimals=2)
- sum() DataFrame[source]¶
Aggregate the columns of this DataFrame to their sum value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.sum() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪══════╡ │ 6 ┆ 21 ┆ null │ └─────┴─────┴──────┘
- sum_horizontal(*, ignore_nulls: bool = True) Series[source]¶
Sum all values horizontally across columns.
- Parameters:
- ignore_nulls
Ignore null values (default). If set to False, any null value in the input will lead to a null output.
- Returns:
- Series
A Series named “sum”.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4.0, 5.0, 6.0], ... } ... ) >>> df.sum_horizontal() shape: (3,) Series: 'sum' [f64] [ 5.0 7.0 9.0 ]
- tail(n: int = 5) DataFrame[source]¶
Get the last n rows.
- Parameters:
- n
Number of rows to return. If a negative value is passed, return all rows except the first abs(n).
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> df.tail(3) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ │ 4 ┆ 9 ┆ d │ │ 5 ┆ 10 ┆ e │ └─────┴─────┴─────┘
Pass a negative value to get all rows except the first abs(n).
>>> df.tail(-3) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 4 ┆ 9 ┆ d │ │ 5 ┆ 10 ┆ e │ └─────┴─────┴─────┘
- to_arrow(*, compat_level: CompatLevel | None = None) pa.Table[source]¶
Collect the underlying arrow arrays in an Arrow Table.
This operation is mostly zero copy.
- Data types that do copy:
CategoricalType
Changed in version 1.1: The future parameter was renamed compat_level.
- Parameters:
- compat_level
Use a specific compatibility level when exporting Polars’ internal data structures.
Examples
>>> df = pl.DataFrame( ... {"foo": [1, 2, 3, 4, 5, 6], "bar": ["a", "b", "c", "d", "e", "f"]} ... ) >>> df.to_arrow() pyarrow.Table foo: int64 bar: large_string ---- foo: [[1,2,3,4,5,6]] bar: [["a","b","c","d","e","f"]]
- to_dict(*, as_series: bool = True) dict[str, Series] | dict[str, list[Any]][source]¶
Convert DataFrame to a dictionary mapping column name to values.
- Parameters:
- as_series
True -> Values are Series False -> Values are List[Any]
See also
Examples
>>> df = pl.DataFrame( ... { ... "A": [1, 2, 3, 4, 5], ... "fruits": ["banana", "banana", "apple", "apple", "banana"], ... "B": [5, 4, 3, 2, 1], ... "cars": ["beetle", "audi", "beetle", "beetle", "beetle"], ... "optional": [28, 300, None, 2, -30], ... } ... ) >>> df shape: (5, 5) ┌─────┬────────┬─────┬────────┬──────────┐ │ A ┆ fruits ┆ B ┆ cars ┆ optional │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 ┆ str ┆ i64 │ ╞═════╪════════╪═════╪════════╪══════════╡ │ 1 ┆ banana ┆ 5 ┆ beetle ┆ 28 │ │ 2 ┆ banana ┆ 4 ┆ audi ┆ 300 │ │ 3 ┆ apple ┆ 3 ┆ beetle ┆ null │ │ 4 ┆ apple ┆ 2 ┆ beetle ┆ 2 │ │ 5 ┆ banana ┆ 1 ┆ beetle ┆ -30 │ └─────┴────────┴─────┴────────┴──────────┘ >>> df.to_dict(as_series=False) {'A': [1, 2, 3, 4, 5], 'fruits': ['banana', 'banana', 'apple', 'apple', 'banana'], 'B': [5, 4, 3, 2, 1], 'cars': ['beetle', 'audi', 'beetle', 'beetle', 'beetle'], 'optional': [28, 300, None, 2, -30]} >>> df.to_dict(as_series=True) {'A': shape: (5,) Series: 'A' [i64] [ 1 2 3 4 5 ], 'fruits': shape: (5,) Series: 'fruits' [str] [ "banana" "banana" "apple" "apple" "banana" ], 'B': shape: (5,) Series: 'B' [i64] [ 5 4 3 2 1 ], 'cars': shape: (5,) Series: 'cars' [str] [ "beetle" "audi" "beetle" "beetle" "beetle" ], 'optional': shape: (5,) Series: 'optional' [i64] [ 28 300 null 2 -30 ]}
- to_dicts() list[dict[str, Any]][source]¶
Convert every row to a dictionary of Python-native values.
Notes
If you have ns-precision temporal values you should be aware that Python natively only supports up to μs-precision; ns-precision values will be truncated to microseconds on conversion to Python. If this matters to your use-case you should export to a different format (such as Arrow or NumPy).
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.to_dicts() [{'foo': 1, 'bar': 4}, {'foo': 2, 'bar': 5}, {'foo': 3, 'bar': 6}]
- to_dummies(columns: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, *, separator: str = '_', drop_first: bool = False, drop_nulls: bool = False) DataFrame[source]¶
Convert categorical variables into dummy/indicator variables.
- Parameters:
- columns
Column name(s) or selector(s) that should be converted to dummy variables. If set to None (default), convert all columns.
- separator
Separator/delimiter used when generating column names.
- drop_first
Remove the first category from the variables being encoded.
- drop_nulls
If there are None values in the series, a null column is not generated
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2], ... "bar": [3, 4], ... "ham": ["a", "b"], ... } ... ) >>> df.to_dummies() shape: (2, 6) ┌───────┬───────┬───────┬───────┬───────┬───────┐ │ foo_1 ┆ foo_2 ┆ bar_3 ┆ bar_4 ┆ ham_a ┆ ham_b │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 │ ╞═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡ │ 1 ┆ 0 ┆ 1 ┆ 0 ┆ 1 ┆ 0 │ │ 0 ┆ 1 ┆ 0 ┆ 1 ┆ 0 ┆ 1 │ └───────┴───────┴───────┴───────┴───────┴───────┘
>>> df.to_dummies(drop_first=True) shape: (2, 3) ┌───────┬───────┬───────┐ │ foo_2 ┆ bar_4 ┆ ham_b │ │ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ u8 │ ╞═══════╪═══════╪═══════╡ │ 0 ┆ 0 ┆ 0 │ │ 1 ┆ 1 ┆ 1 │ └───────┴───────┴───────┘
>>> import polars.selectors as cs >>> df.to_dummies(cs.integer(), separator=":") shape: (2, 5) ┌───────┬───────┬───────┬───────┬─────┐ │ foo:1 ┆ foo:2 ┆ bar:3 ┆ bar:4 ┆ ham │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ u8 ┆ u8 ┆ str │ ╞═══════╪═══════╪═══════╪═══════╪═════╡ │ 1 ┆ 0 ┆ 1 ┆ 0 ┆ a │ │ 0 ┆ 1 ┆ 0 ┆ 1 ┆ b │ └───────┴───────┴───────┴───────┴─────┘
>>> df.to_dummies(cs.integer(), drop_first=True, separator=":") shape: (2, 3) ┌───────┬───────┬─────┐ │ foo:2 ┆ bar:4 ┆ ham │ │ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ str │ ╞═══════╪═══════╪═════╡ │ 0 ┆ 0 ┆ a │ │ 1 ┆ 1 ┆ b │ └───────┴───────┴─────┘
- to_init_repr(n: int = 1000) str[source]¶
Convert DataFrame to instantiable string representation.
- Parameters:
- n
Only use first n rows.
See also
polars.Series.to_init_reprpolars.from_repr
Examples
>>> df = pl.DataFrame( ... [ ... pl.Series("foo", [1, 2, 3], dtype=pl.UInt8), ... pl.Series("bar", [6.0, 7.0, 8.0], dtype=pl.Float32), ... pl.Series("ham", ["a", "b", "c"], dtype=pl.String), ... ] ... ) >>> print(df.to_init_repr()) pl.DataFrame( [ pl.Series('foo', [1, 2, 3], dtype=pl.UInt8), pl.Series('bar', [6.0, 7.0, 8.0], dtype=pl.Float32), pl.Series('ham', ['a', 'b', 'c'], dtype=pl.String), ] )
>>> df_from_str_repr = eval(df.to_init_repr()) >>> df_from_str_repr shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u8 ┆ f32 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- to_jax(return_type: JaxExportType = 'array', *, device: jax.Device | str | None = None, label: str | Expr | Sequence[str | Expr] | None = None, features: str | Expr | Sequence[str | Expr] | None = None, dtype: PolarsDataType | None = None, order: IndexOrder = 'fortran') jax.Array | dict[str, jax.Array][source]¶
Convert DataFrame to a Jax Array, or dict of Jax Arrays.
Added in version 0.20.27.
Warning
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- return_type{“array”, “dict”}
Set return type; a Jax Array, or dict of Jax Arrays.
- device
Specify the jax Device on which the array will be created; can provide a string (such as “cpu”, “gpu”, or “tpu”) in which case the device is retrieved as jax.devices(string)[0]. For more specific control you can supply the instantiated Device directly. If None, arrays are created on the default device.
- label
One or more column names, expressions, or selectors that label the feature data; results in a {“label”: …, “features”: …} dict being returned when return_type is “dict” instead of a {“col”: array, } dict.
- features
One or more column names, expressions, or selectors that contain the feature data; if omitted, all columns that are not designated as part of the label are used. Only applies when return_type is “dict”.
- dtype
Unify the dtype of all returned arrays; this casts any column that is not already of the required dtype before converting to Array. Note that export will be single-precision (32bit) unless the Jax config/environment directs otherwise (eg: “jax_enable_x64” was set True in the config object at startup, or “JAX_ENABLE_X64” is set to “1” in the environment).
- order{“c”, “fortran”}
The index order of the returned Jax array, either C-like (row-major) or Fortran-like (column-major).
See also
Examples
>>> df = pl.DataFrame( ... { ... "lbl": [0, 1, 2, 3], ... "feat1": [1, 0, 0, 1], ... "feat2": [1.5, -0.5, 0.0, -2.25], ... } ... )
Standard return type (2D Array), on the standard device:
>>> df.to_jax() Array([[ 0. , 1. , 1.5 ], [ 1. , 0. , -0.5 ], [ 2. , 0. , 0. ], [ 3. , 1. , -2.25]], dtype=float32)
Create the Array on the default GPU device:
>>> a = df.to_jax(device="gpu") >>> a.device() GpuDevice(id=0, process_index=0)
Create the Array on a specific GPU device:
>>> gpu_device = jax.devices("gpu")[1] >>> a = df.to_jax(device=gpu_device) >>> a.device() GpuDevice(id=1, process_index=0)
As a dictionary of individual Arrays:
>>> df.to_jax("dict") {'lbl': Array([0, 1, 2, 3], dtype=int32), 'feat1': Array([1, 0, 0, 1], dtype=int32), 'feat2': Array([ 1.5 , -0.5 , 0. , -2.25], dtype=float32)}
As a “label” and “features” dictionary; note that as “features” is not declared, it defaults to all the columns that are not in “label”:
>>> df.to_jax("dict", label="lbl") {'label': Array([[0], [1], [2], [3]], dtype=int32), 'features': Array([[ 1. , 1.5 ], [ 0. , -0.5 ], [ 0. , 0. ], [ 1. , -2.25]], dtype=float32)}
As a “label” and “features” dictionary where each is designated using a col or selector expression (which can also be used to cast the data if the label and features are better-represented with different dtypes):
>>> import polars.selectors as cs >>> df.to_jax( ... return_type="dict", ... features=cs.float(), ... label=pl.col("lbl").cast(pl.UInt8), ... ) {'label': Array([[0], [1], [2], [3]], dtype=uint8), 'features': Array([[ 1.5 ], [-0.5 ], [ 0. ], [-2.25]], dtype=float32)}
- to_numpy(*, order: IndexOrder = 'fortran', writable: bool = False, allow_copy: bool = True, structured: bool = False, use_pyarrow: bool | None = None) np.ndarray[Any, Any][source]¶
Convert this DataFrame to a NumPy ndarray.
This operation copies data only when necessary. The conversion is zero copy when all of the following hold:
The DataFrame is fully contiguous in memory, with all Series back-to-back and all Series consisting of a single chunk.
The data type is an integer or float.
The DataFrame contains no null values.
The order parameter is set to fortran (default).
The writable parameter is set to False (default).
- Parameters:
- order
The index order of the returned NumPy array, either C-like or Fortran-like. In general, using the Fortran-like index order is faster. However, the C-like order might be more appropriate to use for downstream applications to prevent cloning data, e.g. when reshaping into a one-dimensional array.
- writable
Ensure the resulting array is writable. This will force a copy of the data if the array was created without copy, as the underlying Arrow data is immutable.
- allow_copy
Allow memory to be copied to perform the conversion. If set to False, causes conversions that are not zero-copy to fail.
- structured
Return a structured array with a data type that corresponds to the DataFrame schema. If set to False (default), a 2D ndarray is returned instead.
- use_pyarrow
-
function for the conversion to NumPy if necessary.
Deprecated since version 0.20.28: Polars now uses its native engine by default for conversion to NumPy.
Examples
Numeric data without nulls can be converted without copying data in some cases. The resulting array will not be writable.
>>> df = pl.DataFrame({"a": [1, 2, 3]}) >>> arr = df.to_numpy() >>> arr array([[1], [2], [3]]) >>> arr.flags.writeable False
Set writable=True to force data copy to make the array writable.
>>> df.to_numpy(writable=True).flags.writeable True
If the DataFrame contains different numeric data types, the resulting data type will be the supertype. This requires data to be copied. Integer types with nulls are cast to a float type with nan representing a null value.
>>> df = pl.DataFrame({"a": [1, 2, None], "b": [4.0, 5.0, 6.0]}) >>> df.to_numpy() array([[ 1., 4.], [ 2., 5.], [nan, 6.]])
Set allow_copy=False to raise an error if data would be copied.
>>> s.to_numpy(allow_copy=False) Traceback (most recent call last): ... RuntimeError: copy not allowed: cannot convert to a NumPy array without copying data
Polars defaults to F-contiguous order. Use order=”c” to force the resulting array to be C-contiguous.
>>> df.to_numpy(order="c").flags.c_contiguous True
DataFrames with mixed types will result in an array with an object dtype.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.5, 7.0, 8.5], ... "ham": ["a", "b", "c"], ... }, ... schema_overrides={"foo": pl.UInt8, "bar": pl.Float32}, ... ) >>> df.to_numpy() array([[1, 6.5, 'a'], [2, 7.0, 'b'], [3, 8.5, 'c']], dtype=object)
Set structured=True to convert to a structured array, which can better preserve individual column data such as name and data type.
>>> df.to_numpy(structured=True) array([(1, 6.5, 'a'), (2, 7. , 'b'), (3, 8.5, 'c')], dtype=[('foo', 'u1'), ('bar', '<f4'), ('ham', '<U1')])
- to_pandas(*, use_pyarrow_extension_array: bool = False, **kwargs: Any) pd.DataFrame[source]¶
Convert this DataFrame to a pandas DataFrame.
This operation copies data if use_pyarrow_extension_array is not enabled.
- Parameters:
- use_pyarrow_extension_array
Use PyArrow-backed extension arrays instead of NumPy arrays for the columns of the pandas DataFrame. This allows zero copy operations and preservation of null values. Subsequent operations on the resulting pandas DataFrame may trigger conversion to NumPy if those operations are not supported by PyArrow compute functions.
- **kwargs
Additional keyword arguments to be passed to
pyarrow.Table.to_pandas().
- Returns:
pandas.DataFrame
Notes
This operation requires that both
pandasandpyarroware installed.Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.to_pandas() foo bar ham 0 1 6.0 a 1 2 7.0 b 2 3 8.0 c
Null values in numeric columns are converted to NaN.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, None], ... "bar": [6.0, None, 8.0], ... "ham": [None, "b", "c"], ... } ... ) >>> df.to_pandas() foo bar ham 0 1.0 6.0 None 1 2.0 NaN b 2 NaN 8.0 c
Pass use_pyarrow_extension_array=True to get a pandas DataFrame with columns backed by PyArrow extension arrays. This will preserve null values.
>>> df.to_pandas(use_pyarrow_extension_array=True) foo bar ham 0 1 6.0 <NA> 1 2 <NA> b 2 <NA> 8.0 c >>> _.dtypes foo int64[pyarrow] bar double[pyarrow] ham large_string[pyarrow] dtype: object
- to_series(index: int = 0) Series[source]¶
Select column as Series at index location.
- Parameters:
- index
Location of selection.
See also
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.to_series(1) shape: (3,) Series: 'bar' [i64] [ 6 7 8 ]
- to_struct(name: str = '') Series[source]¶
Convert a DataFrame to a Series of type Struct.
- Parameters:
- name
Name for the struct Series
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4, 5], ... "b": ["one", "two", "three", "four", "five"], ... } ... ) >>> df.to_struct("nums") shape: (5,) Series: 'nums' [struct[2]] [ {1,"one"} {2,"two"} {3,"three"} {4,"four"} {5,"five"} ]
- to_torch(return_type: TorchExportType = 'tensor', *, label: str | Expr | Sequence[str | Expr] | None = None, features: str | Expr | Sequence[str | Expr] | None = None, dtype: PolarsDataType | None = None) torch.Tensor | dict[str, torch.Tensor] | PolarsDataset[source]¶
Convert DataFrame to a PyTorch Tensor, Dataset, or dict of Tensors.
Added in version 0.20.23.
Warning
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- return_type{“tensor”, “dataset”, “dict”}
Set return type; a PyTorch Tensor, PolarsDataset (a frame-specialized TensorDataset), or dict of Tensors.
- label
One or more column names, expressions, or selectors that label the feature data; when return_type is “dataset”, the PolarsDataset will return (features, label) tensor tuples for each row. Otherwise, it returns (features,) tensor tuples where the feature contains all the row data.
- features
One or more column names, expressions, or selectors that contain the feature data; if omitted, all columns that are not designated as part of the label are used.
- dtype
Unify the dtype of all returned tensors; this casts any column that is not of the required dtype before converting to Tensor. This includes the label column unless the label is an expression (such as pl.col(“label_column”).cast(pl.Int16)).
See also
Examples
>>> df = pl.DataFrame( ... { ... "lbl": [0, 1, 2, 3], ... "feat1": [1, 0, 0, 1], ... "feat2": [1.5, -0.5, 0.0, -2.25], ... } ... )
Standard return type (Tensor), with f32 supertype:
>>> df.to_torch(dtype=pl.Float32) tensor([[ 0.0000, 1.0000, 1.5000], [ 1.0000, 0.0000, -0.5000], [ 2.0000, 0.0000, 0.0000], [ 3.0000, 1.0000, -2.2500]])
As a dictionary of individual Tensors:
>>> df.to_torch("dict") {'lbl': tensor([0, 1, 2, 3]), 'feat1': tensor([1, 0, 0, 1]), 'feat2': tensor([ 1.5000, -0.5000, 0.0000, -2.2500], dtype=torch.float64)}
As a “label” and “features” dictionary; note that as “features” is not declared, it defaults to all the columns that are not in “label”:
>>> df.to_torch("dict", label="lbl", dtype=pl.Float32) {'label': tensor([[0.], [1.], [2.], [3.]]), 'features': tensor([[ 1.0000, 1.5000], [ 0.0000, -0.5000], [ 0.0000, 0.0000], [ 1.0000, -2.2500]])}
As a PolarsDataset, with f64 supertype:
>>> ds = df.to_torch("dataset", dtype=pl.Float64) >>> ds[3] (tensor([ 3.0000, 1.0000, -2.2500], dtype=torch.float64),) >>> ds[:2] (tensor([[ 0.0000, 1.0000, 1.5000], [ 1.0000, 0.0000, -0.5000]], dtype=torch.float64),) >>> ds[[0, 3]] (tensor([[ 0.0000, 1.0000, 1.5000], [ 3.0000, 1.0000, -2.2500]], dtype=torch.float64),)
As a convenience the PolarsDataset can opt in to half-precision data for experimentation (usually this would be set on the model/pipeline):
>>> list(ds.half()) [(tensor([0.0000, 1.0000, 1.5000], dtype=torch.float16),), (tensor([ 1.0000, 0.0000, -0.5000], dtype=torch.float16),), (tensor([2., 0., 0.], dtype=torch.float16),), (tensor([ 3.0000, 1.0000, -2.2500], dtype=torch.float16),)]
Pass PolarsDataset to a DataLoader, designating the label:
>>> from torch.utils.data import DataLoader >>> ds = df.to_torch("dataset", label="lbl") >>> dl = DataLoader(ds, batch_size=2) >>> batches = list(dl) >>> batches[0] [tensor([[ 1.0000, 1.5000], [ 0.0000, -0.5000]], dtype=torch.float64), tensor([0, 1])]
Note that labels can be given as expressions, allowing them to have a dtype independent of the feature columns (multi-column labels are supported).
>>> ds = df.to_torch( ... return_type="dataset", ... dtype=pl.Float32, ... label=pl.col("lbl").cast(pl.Int16), ... ) >>> ds[:2] (tensor([[ 1.0000, 1.5000], [ 0.0000, -0.5000]]), tensor([0, 1], dtype=torch.int16))
Easily integrate with (for example) scikit-learn and other datasets:
>>> from sklearn.datasets import fetch_california_housing >>> housing = fetch_california_housing() >>> df = pl.DataFrame( ... data=housing.data, ... schema=housing.feature_names, ... ).with_columns( ... Target=housing.target, ... ) >>> train = df.to_torch("dataset", label="Target") >>> loader = DataLoader( ... train, ... shuffle=True, ... batch_size=64, ... )
- top_k(k: int, *, by: IntoExpr | Iterable[IntoExpr], reverse: bool | Sequence[bool] = False) DataFrame[source]¶
Return the k largest rows.
Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call
sort()after this function if you wish the output to be sorted.Changed in version 1.0.0: The descending parameter was renamed reverse.
- Parameters:
- k
Number of rows to return.
- by
Column(s) used to determine the top rows. Accepts expression input. Strings are parsed as column names.
- reverse
Consider the k smallest elements of the by column(s) (instead of the k largest). This can be specified per column by passing a sequence of booleans.
See also
Examples
>>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... )
Get the rows which contain the 4 largest values in column b.
>>> df.top_k(4, by="b") shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 3 │ │ a ┆ 2 │ │ b ┆ 2 │ │ b ┆ 1 │ └─────┴─────┘
Get the rows which contain the 4 largest values when sorting on column b and a.
>>> df.top_k(4, by=["b", "a"]) shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 3 │ │ b ┆ 2 │ │ a ┆ 2 │ │ c ┆ 1 │ └─────┴─────┘
- transpose(*, include_header: bool = False, header_name: str = 'column', column_names: str | Iterable[str] | None = None) DataFrame[source]¶
Transpose a DataFrame over the diagonal.
- Parameters:
- include_header
If set, the column names will be added as first column.
- header_name
If include_header is set, this determines the name of the column that will be inserted.
- column_names
Optional iterable yielding strings or a string naming an existing column. These will name the value (non-header) columns in the transposed data.
- Returns:
- DataFrame
Notes
This is a very expensive operation. Perhaps you can do it differently.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> df.transpose(include_header=True) shape: (2, 4) ┌────────┬──────────┬──────────┬──────────┐ │ column ┆ column_0 ┆ column_1 ┆ column_2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞════════╪══════════╪══════════╪══════════╡ │ a ┆ 1 ┆ 2 ┆ 3 │ │ b ┆ 4 ┆ 5 ┆ 6 │ └────────┴──────────┴──────────┴──────────┘
Replace the auto-generated column names with a list
>>> df.transpose(include_header=False, column_names=["x", "y", "z"]) shape: (2, 3) ┌─────┬─────┬─────┐ │ x ┆ y ┆ z │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┘
Include the header as a separate column
>>> df.transpose( ... include_header=True, header_name="foo", column_names=["x", "y", "z"] ... ) shape: (2, 4) ┌─────┬─────┬─────┬─────┐ │ foo ┆ x ┆ y ┆ z │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═════╡ │ a ┆ 1 ┆ 2 ┆ 3 │ │ b ┆ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┴─────┘
Replace the auto-generated column with column names from a generator function
>>> def name_generator(): ... base_name = "my_column_" ... count = 0 ... while True: ... yield f"{base_name}{count}" ... count += 1 >>> df.transpose(include_header=False, column_names=name_generator()) shape: (2, 3) ┌─────────────┬─────────────┬─────────────┐ │ my_column_0 ┆ my_column_1 ┆ my_column_2 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════════════╪═════════════╪═════════════╡ │ 1 ┆ 2 ┆ 3 │ │ 4 ┆ 5 ┆ 6 │ └─────────────┴─────────────┴─────────────┘
Use an existing column as the new column names
>>> df = pl.DataFrame(dict(id=["i", "j", "k"], a=[1, 2, 3], b=[4, 5, 6])) >>> df.transpose(column_names="id") shape: (2, 3) ┌─────┬─────┬─────┐ │ i ┆ j ┆ k │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┘ >>> df.transpose(include_header=True, header_name="new_id", column_names="id") shape: (2, 4) ┌────────┬─────┬─────┬─────┐ │ new_id ┆ i ┆ j ┆ k │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞════════╪═════╪═════╪═════╡ │ a ┆ 1 ┆ 2 ┆ 3 │ │ b ┆ 4 ┆ 5 ┆ 6 │ └────────┴─────┴─────┴─────┘
- unique(subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None, *, keep: UniqueKeepStrategy = 'any', maintain_order: bool = False) DataFrame[source]¶
Drop duplicate rows from this dataframe.
- Parameters:
- subset
Column name(s) or selector(s), to consider when identifying duplicate rows. If set to None (default), use all columns.
- keep{‘first’, ‘last’, ‘any’, ‘none’}
Which of the duplicate rows to keep.
- ‘any’: Does not give any guarantee of which row is kept.
This allows more optimizations.
‘none’: Don’t keep duplicate rows.
‘first’: Keep first unique row.
‘last’: Keep last unique row.
- maintain_order
Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to True blocks the possibility to run on the streaming engine.
- Returns:
- DataFrame
DataFrame with unique rows.
Warning
This method will fail if there is a column of type List in the DataFrame or subset.
Notes
If you’re coming from pandas, this is similar to pandas.DataFrame.drop_duplicates.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 1], ... "bar": ["a", "a", "a", "a"], ... "ham": ["b", "b", "b", "b"], ... } ... ) >>> df.unique(maintain_order=True) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> df.unique(subset=["bar", "ham"], maintain_order=True) shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> df.unique(keep="last", maintain_order=True) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ │ 1 ┆ a ┆ b │ └─────┴─────┴─────┘
- unnest(columns: ColumnNameOrSelector | Collection[ColumnNameOrSelector], *more_columns: ColumnNameOrSelector) DataFrame[source]¶
Decompose struct columns into separate columns for each of their fields.
The new columns will be inserted into the dataframe at the location of the struct column.
- Parameters:
- columns
Name of the struct column(s) that should be unnested.
- *more_columns
Additional columns to unnest, specified as positional arguments.
Examples
>>> df = pl.DataFrame( ... { ... "before": ["foo", "bar"], ... "t_a": [1, 2], ... "t_b": ["a", "b"], ... "t_c": [True, None], ... "t_d": [[1, 2], [3]], ... "after": ["baz", "womp"], ... } ... ).select("before", pl.struct(pl.col("^t_.$")).alias("t_struct"), "after") >>> df shape: (2, 3) ┌────────┬─────────────────────┬───────┐ │ before ┆ t_struct ┆ after │ │ --- ┆ --- ┆ --- │ │ str ┆ struct[4] ┆ str │ ╞════════╪═════════════════════╪═══════╡ │ foo ┆ {1,"a",true,[1, 2]} ┆ baz │ │ bar ┆ {2,"b",null,[3]} ┆ womp │ └────────┴─────────────────────┴───────┘ >>> df.unnest("t_struct") shape: (2, 6) ┌────────┬─────┬─────┬──────┬───────────┬───────┐ │ before ┆ t_a ┆ t_b ┆ t_c ┆ t_d ┆ after │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str │ ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡ │ foo ┆ 1 ┆ a ┆ true ┆ [1, 2] ┆ baz │ │ bar ┆ 2 ┆ b ┆ null ┆ [3] ┆ womp │ └────────┴─────┴─────┴──────┴───────────┴───────┘
- unpivot(on: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, *, index: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, variable_name: str | None = None, value_name: str | None = None) DataFrame[source]¶
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are “unpivoted” to the row axis leaving just two non-identifier columns, ‘variable’ and ‘value’.
- Parameters:
- on
Column(s) or selector(s) to use as values variables; if on is empty all columns that are not in index will be used.
- index
Column(s) or selector(s) to use as identifier variables.
- variable_name
Name to give to the variable column. Defaults to “variable”
- value_name
Name to give to the value column. Defaults to “value”
Notes
If you’re coming from pandas, this is similar to pandas.DataFrame.melt, but with index replacing id_vars and on replacing value_vars. In other frameworks, you might know this operation as pivot_longer.
Examples
>>> df = pl.DataFrame( ... { ... "a": ["x", "y", "z"], ... "b": [1, 3, 5], ... "c": [2, 4, 6], ... } ... ) >>> import polars.selectors as cs >>> df.unpivot(cs.numeric(), index="a") shape: (6, 3) ┌─────┬──────────┬───────┐ │ a ┆ variable ┆ value │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞═════╪══════════╪═══════╡ │ x ┆ b ┆ 1 │ │ y ┆ b ┆ 3 │ │ z ┆ b ┆ 5 │ │ x ┆ c ┆ 2 │ │ y ┆ c ┆ 4 │ │ z ┆ c ┆ 6 │ └─────┴──────────┴───────┘
- unstack(*, step: int, how: UnstackDirection = 'vertical', columns: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, fill_values: list[Any] | None = None) DataFrame[source]¶
Unstack a long table to a wide form without doing an aggregation.
This can be much faster than a pivot, because it can skip the grouping phase.
- Parameters:
- step
Number of rows in the unstacked frame.
- how{ ‘vertical’, ‘horizontal’ }
Direction of the unstack.
- columns
Column name(s) or selector(s) to include in the operation. If set to None (default), use all columns.
- fill_values
Fill values that don’t fit the new size with this value.
Examples
>>> from string import ascii_uppercase >>> df = pl.DataFrame( ... { ... "x": list(ascii_uppercase[0:8]), ... "y": pl.int_range(1, 9, eager=True), ... } ... ).with_columns( ... z=pl.int_ranges(pl.col("y"), pl.col("y") + 2, dtype=pl.UInt8), ... ) >>> df shape: (8, 3) ┌─────┬─────┬──────────┐ │ x ┆ y ┆ z │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ list[u8] │ ╞═════╪═════╪══════════╡ │ A ┆ 1 ┆ [1, 2] │ │ B ┆ 2 ┆ [2, 3] │ │ C ┆ 3 ┆ [3, 4] │ │ D ┆ 4 ┆ [4, 5] │ │ E ┆ 5 ┆ [5, 6] │ │ F ┆ 6 ┆ [6, 7] │ │ G ┆ 7 ┆ [7, 8] │ │ H ┆ 8 ┆ [8, 9] │ └─────┴─────┴──────────┘ >>> df.unstack(step=4, how="vertical") shape: (4, 6) ┌─────┬─────┬─────┬─────┬──────────┬──────────┐ │ x_0 ┆ x_1 ┆ y_0 ┆ y_1 ┆ z_0 ┆ z_1 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ i64 ┆ list[u8] ┆ list[u8] │ ╞═════╪═════╪═════╪═════╪══════════╪══════════╡ │ A ┆ E ┆ 1 ┆ 5 ┆ [1, 2] ┆ [5, 6] │ │ B ┆ F ┆ 2 ┆ 6 ┆ [2, 3] ┆ [6, 7] │ │ C ┆ G ┆ 3 ┆ 7 ┆ [3, 4] ┆ [7, 8] │ │ D ┆ H ┆ 4 ┆ 8 ┆ [4, 5] ┆ [8, 9] │ └─────┴─────┴─────┴─────┴──────────┴──────────┘ >>> df.unstack(step=2, how="horizontal") shape: (4, 6) ┌─────┬─────┬─────┬─────┬──────────┬──────────┐ │ x_0 ┆ x_1 ┆ y_0 ┆ y_1 ┆ z_0 ┆ z_1 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ i64 ┆ list[u8] ┆ list[u8] │ ╞═════╪═════╪═════╪═════╪══════════╪══════════╡ │ A ┆ B ┆ 1 ┆ 2 ┆ [1, 2] ┆ [2, 3] │ │ C ┆ D ┆ 3 ┆ 4 ┆ [3, 4] ┆ [4, 5] │ │ E ┆ F ┆ 5 ┆ 6 ┆ [5, 6] ┆ [6, 7] │ │ G ┆ H ┆ 7 ┆ 8 ┆ [7, 8] ┆ [8, 9] │ └─────┴─────┴─────┴─────┴──────────┴──────────┘ >>> import polars.selectors as cs >>> df.unstack(step=5, columns=cs.numeric(), fill_values=0) shape: (5, 2) ┌─────┬─────┐ │ y_0 ┆ y_1 │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 6 │ │ 2 ┆ 7 │ │ 3 ┆ 8 │ │ 4 ┆ 0 │ │ 5 ┆ 0 │ └─────┴─────┘
- update(other: DataFrame, on: str | Sequence[str] | None = None, how: Literal['left', 'inner', 'full'] = 'left', *, left_on: str | Sequence[str] | None = None, right_on: str | Sequence[str] | None = None, include_nulls: bool = False, maintain_order: MaintainOrderJoin | None = 'left') DataFrame[source]¶
Update the values in this DataFrame with the values in other.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- other
DataFrame that will be used to update the values
- on
Column names that will be joined on. If set to None (default), the implicit row index of each frame is used as a join key.
- how{‘left’, ‘inner’, ‘full’}
‘left’ will keep all rows from the left table; rows may be duplicated if multiple rows in the right frame match the left row’s key.
‘inner’ keeps only those rows where the key exists in both frames.
‘full’ will update existing rows where the key matches while also adding any new rows contained in the given frame.
- left_on
Join column(s) of the left DataFrame.
- right_on
Join column(s) of the right DataFrame.
- include_nulls
Overwrite values in the left frame with null values from the right frame. If set to False (default), null values in the right frame are ignored.
- maintain_order{‘none’, ‘left’, ‘right’, ‘left_right’, ‘right_left’}
Which order of rows from the inputs to preserve. See
join()for details. Unlike join this function preserves the left order by default.
Notes
This is syntactic sugar for a left/inner join that preserves the order of the left DataFrame by default, with an optional coalesce when include_nulls = False.
Examples
>>> df = pl.DataFrame( ... { ... "A": [1, 2, 3, 4], ... "B": [400, 500, 600, 700], ... } ... ) >>> df shape: (4, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 400 │ │ 2 ┆ 500 │ │ 3 ┆ 600 │ │ 4 ┆ 700 │ └─────┴─────┘ >>> new_df = pl.DataFrame( ... { ... "B": [-66, None, -99], ... "C": [5, 3, 1], ... } ... )
Update df values with the non-null values in new_df, by row index:
>>> df.update(new_df) shape: (4, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -66 │ │ 2 ┆ 500 │ │ 3 ┆ -99 │ │ 4 ┆ 700 │ └─────┴─────┘
Update df values with the non-null values in new_df, by row index, but only keeping those rows that are common to both frames:
>>> df.update(new_df, how="inner") shape: (3, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -66 │ │ 2 ┆ 500 │ │ 3 ┆ -99 │ └─────┴─────┘
Update df values with the non-null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:
>>> df.update(new_df, left_on=["A"], right_on=["C"], how="full") shape: (5, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -99 │ │ 2 ┆ 500 │ │ 3 ┆ 600 │ │ 4 ┆ 700 │ │ 5 ┆ -66 │ └─────┴─────┘
Update df values including null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:
>>> df.update(new_df, left_on="A", right_on="C", how="full", include_nulls=True) shape: (5, 2) ┌─────┬──────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪══════╡ │ 1 ┆ -99 │ │ 2 ┆ 500 │ │ 3 ┆ null │ │ 4 ┆ 700 │ │ 5 ┆ -66 │ └─────┴──────┘
- upsample(time_column: str, *, every: str | timedelta, group_by: str | Sequence[str] | None = None, maintain_order: bool = False) DataFrame[source]¶
Upsample a DataFrame at a regular frequency.
The every argument is created with the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them:
“3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
Changed in version 0.20.14: The by parameter was renamed group_by.
- Parameters:
- time_column
Time column will be used to determine a date_range. Note that this column has to be sorted for the output to make sense.
- every
Interval will start ‘every’ duration.
- group_by
First group by these columns and then upsample for every group.
- maintain_order
Keep the ordering predictable. This is slower.
- Returns:
- DataFrame
Result will be sorted by time_column (but note that if group_by columns are passed, it will only be sorted within each group).
Examples
Upsample a DataFrame by a certain interval.
>>> from datetime import datetime >>> df = pl.DataFrame( ... { ... "time": [ ... datetime(2021, 2, 1), ... datetime(2021, 4, 1), ... datetime(2021, 5, 1), ... datetime(2021, 6, 1), ... ], ... "groups": ["A", "B", "A", "B"], ... "values": [0, 1, 2, 3], ... } ... ).set_sorted("time") >>> df.upsample( ... time_column="time", every="1mo", group_by="groups", maintain_order=True ... ).select(pl.all().fill_null(strategy="forward")) shape: (7, 3) ┌─────────────────────┬────────┬────────┐ │ time ┆ groups ┆ values │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ str ┆ i64 │ ╞═════════════════════╪════════╪════════╡ │ 2021-02-01 00:00:00 ┆ A ┆ 0 │ │ 2021-03-01 00:00:00 ┆ A ┆ 0 │ │ 2021-04-01 00:00:00 ┆ A ┆ 0 │ │ 2021-05-01 00:00:00 ┆ A ┆ 2 │ │ 2021-04-01 00:00:00 ┆ B ┆ 1 │ │ 2021-05-01 00:00:00 ┆ B ┆ 1 │ │ 2021-06-01 00:00:00 ┆ B ┆ 3 │ └─────────────────────┴────────┴────────┘
- var(ddof: int = 1) DataFrame[source]¶
Aggregate the columns of this DataFrame to their variance value.
- Parameters:
- ddof
“Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.var() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 1.0 ┆ 1.0 ┆ null │ └─────┴─────┴──────┘ >>> df.var(ddof=0) shape: (1, 3) ┌──────────┬──────────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════════╪══════════╪══════╡ │ 0.666667 ┆ 0.666667 ┆ null │ └──────────┴──────────┴──────┘
- vstack(other: DataFrame, *, in_place: bool = False) DataFrame[source]¶
Grow this DataFrame vertically by stacking a DataFrame to it.
- Parameters:
- other
DataFrame to stack.
- in_place
Modify in place.
See also
Examples
>>> df1 = pl.DataFrame( ... { ... "foo": [1, 2], ... "bar": [6, 7], ... "ham": ["a", "b"], ... } ... ) >>> df2 = pl.DataFrame( ... { ... "foo": [3, 4], ... "bar": [8, 9], ... "ham": ["c", "d"], ... } ... ) >>> df1.vstack(df2) shape: (4, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ │ 4 ┆ 9 ┆ d │ └─────┴─────┴─────┘
- property width: int¶
Get the number of columns.
- Returns:
- int
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4, 5, 6], ... } ... ) >>> df.width 2
- with_columns(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) DataFrame[source]¶
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
- Parameters:
- *exprs
Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
- DataFrame
A new DataFrame with the columns added.
Notes
Creating a new DataFrame using this method does not create a new copy of existing data.
Examples
Pass an expression to add it as a new column.
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.with_columns((pl.col("a") ** 2).alias("a^2")) shape: (4, 4) ┌─────┬──────┬───────┬─────┐ │ a ┆ b ┆ c ┆ a^2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ i64 │ ╞═════╪══════╪═══════╪═════╡ │ 1 ┆ 0.5 ┆ true ┆ 1 │ │ 2 ┆ 4.0 ┆ true ┆ 4 │ │ 3 ┆ 10.0 ┆ false ┆ 9 │ │ 4 ┆ 13.0 ┆ true ┆ 16 │ └─────┴──────┴───────┴─────┘
Added columns will replace existing columns with the same name.
>>> df.with_columns(pl.col("a").cast(pl.Float64)) shape: (4, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╡ │ 1.0 ┆ 0.5 ┆ true │ │ 2.0 ┆ 4.0 ┆ true │ │ 3.0 ┆ 10.0 ┆ false │ │ 4.0 ┆ 13.0 ┆ true │ └─────┴──────┴───────┘
Multiple columns can be added using positional arguments.
>>> df.with_columns( ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").not_()).alias("not c"), ... ) shape: (4, 6) ┌─────┬──────┬───────┬─────┬──────┬───────┐ │ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ i64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪═════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1 ┆ 0.25 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 4 ┆ 2.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 9 ┆ 5.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 16 ┆ 6.5 ┆ false │ └─────┴──────┴───────┴─────┴──────┴───────┘
Multiple columns can also be added by passing a list of expressions.
>>> df.with_columns( ... [ ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").not_()).alias("not c"), ... ] ... ) shape: (4, 6) ┌─────┬──────┬───────┬─────┬──────┬───────┐ │ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ i64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪═════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1 ┆ 0.25 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 4 ┆ 2.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 9 ┆ 5.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 16 ┆ 6.5 ┆ false │ └─────┴──────┴───────┴─────┴──────┴───────┘
Use keyword arguments to easily name your expression inputs.
>>> df.with_columns( ... ab=pl.col("a") * pl.col("b"), ... not_c=pl.col("c").not_(), ... ) shape: (4, 5) ┌─────┬──────┬───────┬──────┬───────┐ │ a ┆ b ┆ c ┆ ab ┆ not_c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 0.5 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 8.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 30.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 52.0 ┆ false │ └─────┴──────┴───────┴──────┴───────┘
- with_columns_seq(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) DataFrame[source]¶
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
- Parameters:
- *exprs
Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
- DataFrame
A new DataFrame with the columns added.
See also
- with_row_count(name: str = 'row_nr', offset: int = 0) DataFrame[source]¶
Add a column at index 0 that counts the rows.
Deprecated since version 0.20.4: Use the
with_row_index()method instead. Note that the default column name has changed from ‘row_nr’ to ‘index’.- Parameters:
- name
Name of the column to add.
- offset
Start the row count at this offset. Default = 0
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> df.with_row_count() shape: (3, 3) ┌────────┬─────┬─────┐ │ row_nr ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞════════╪═════╪═════╡ │ 0 ┆ 1 ┆ 2 │ │ 1 ┆ 3 ┆ 4 │ │ 2 ┆ 5 ┆ 6 │ └────────┴─────┴─────┘
- with_row_index(name: str = 'index', offset: int = 0) DataFrame[source]¶
Add a row index as the first column in the DataFrame.
- Parameters:
- name
Name of the index column.
- offset
Start the index at this offset. Cannot be negative.
Notes
The resulting column does not have any special properties. It is a regular column of type UInt32 (or UInt64 in polars-u64-idx).
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> df.with_row_index() shape: (3, 3) ┌───────┬─────┬─────┐ │ index ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞═══════╪═════╪═════╡ │ 0 ┆ 1 ┆ 2 │ │ 1 ┆ 3 ┆ 4 │ │ 2 ┆ 5 ┆ 6 │ └───────┴─────┴─────┘ >>> df.with_row_index("id", offset=1000) shape: (3, 3) ┌──────┬─────┬─────┐ │ id ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞══════╪═════╪═════╡ │ 1000 ┆ 1 ┆ 2 │ │ 1001 ┆ 3 ┆ 4 │ │ 1002 ┆ 5 ┆ 6 │ └──────┴─────┴─────┘
An index column can also be created using the expressions
int_range()andlen().>>> df.select( ... pl.int_range(pl.len(), dtype=pl.UInt32).alias("index"), ... pl.all(), ... ) shape: (3, 3) ┌───────┬─────┬─────┐ │ index ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞═══════╪═════╪═════╡ │ 0 ┆ 1 ┆ 2 │ │ 1 ┆ 3 ┆ 4 │ │ 2 ┆ 5 ┆ 6 │ └───────┴─────┴─────┘
- write_avro(file: str | Path | IO[bytes], compression: AvroCompression = 'uncompressed', name: str = '') None[source]¶
Write to Apache Avro file.
- Parameters:
- file
File path or writable file-like object to which the data will be written.
- compression{‘uncompressed’, ‘snappy’, ‘deflate’}
Compression method. Defaults to “uncompressed”.
- name
Schema name. Defaults to empty string.
Examples
>>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.avro" >>> df.write_avro(path)
- write_clipboard(*, separator: str = '\t', **kwargs: Any) None[source]¶
Copy DataFrame in csv format to the system clipboard with write_csv.
Useful for pasting into Excel or other similar spreadsheet software.
- Parameters:
- separator
Separate CSV fields with this symbol.
- kwargs
Additional arguments to pass to write_csv.
See also
polars.read_clipboardRead a DataFrame from the clipboard.
write_csvWrite to comma-separated values (CSV) file.
- write_csv(file: str | Path | IO[str] | IO[bytes] | None = None, *, include_bom: bool = False, include_header: bool = True, separator: str = ',', line_terminator: str = '\n', quote_char: str = '"', batch_size: int = 1024, datetime_format: str | None = None, date_format: str | None = None, time_format: str | None = None, float_scientific: bool | None = None, float_precision: int | None = None, decimal_comma: bool = False, null_value: str | None = None, quote_style: CsvQuoteStyle | None = None, storage_options: dict[str, Any] | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int = 2) str | None[source]¶
Write to comma-separated values (CSV) file.
- Parameters:
- file
File path or writable file-like object to which the result will be written. If set to None (default), the output is returned as a string instead.
- include_bom
Whether to include UTF-8 BOM in the CSV output.
- include_header
Whether to include header in the CSV output.
- separator
Separate CSV fields with this symbol.
- line_terminator
String used to end each row.
- quote_char
Byte to use as quoting character.
- batch_size
Number of rows that will be processed per thread.
- datetime_format
A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).
- date_format
A format string, with the specifiers defined by the chrono Rust crate.
- time_format
A format string, with the specifiers defined by the chrono Rust crate.
- float_scientific
Whether to use scientific form always (true), never (false), or automatically (None) for Float32 and Float64 datatypes.
- float_precision
Number of decimal places to write, applied to both Float32 and Float64 datatypes.
- decimal_comma
Use a comma as the decimal separator instead of a point in standard notation. Floats will be encapsulated in quotes if necessary; set the field separator to override.
- null_value
A string representing null values (defaulting to the empty string).
- quote_style{‘necessary’, ‘always’, ‘non_numeric’, ‘never’}
Determines the quoting strategy used.
necessary (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, separator or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
always: This puts quotes around every field. Always.
never: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
non_numeric: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren`t strictly necessary.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (hf://): Accepts an API key under the token parameter: {‘token’: ‘…’}, or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
Examples
>>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.csv" >>> df.write_csv(path, separator=",")
- write_database(table_name: str, connection: ConnectionOrCursor | str, *, if_table_exists: DbWriteMode = 'fail', engine: DbWriteEngine | None = None, engine_options: dict[str, Any] | None = None) int[source]¶
Write the data in a Polars DataFrame to a database.
Added in version 0.20.26: Support for instantiated connection objects in addition to URI strings, and a new engine_options parameter.
- Parameters:
- table_name
Schema-qualified name of the table to create or append to in the target SQL database. If your table name contains special characters, it should be quoted.
- connection
An existing SQLAlchemy or ADBC connection against the target database, or a URI string that will be used to instantiate such a connection, such as:
“postgresql://user:pass@server:port/database”
“sqlite:////path/to/database.db”
- if_table_exists{‘append’, ‘replace’, ‘fail’}
The insert mode:
‘replace’ will create a new database table, overwriting an existing one.
‘append’ will append to an existing table.
‘fail’ will fail if table already exists.
- engine{‘sqlalchemy’, ‘adbc’}
Select the engine to use for writing frame data; only necessary when supplying a URI string (defaults to ‘sqlalchemy’ if unset)
- engine_options
Additional options to pass to the insert method associated with the engine specified by the option engine.
Setting engine to “sqlalchemy” currently inserts using Pandas’ to_sql method (though this will eventually be phased out in favor of a native solution).
Setting engine to “adbc” inserts using the ADBC cursor’s adbc_ingest method.
- Returns:
- int
The number of rows affected, if the driver provides this information. Otherwise, returns -1.
Examples
Insert into a temporary table using a PostgreSQL URI and the ADBC engine:
>>> df.write_database( ... table_name="target_table", ... connection="postgresql://user:pass@server:port/database", ... engine="adbc", ... engine_options={"temporary": True}, ... )
Insert into a table using a pyodbc SQLAlchemy connection to SQL Server that was instantiated with “fast_executemany=True” to improve performance:
>>> pyodbc_uri = ( ... "mssql+pyodbc://user:pass@server:1433/test?" ... "driver=ODBC+Driver+18+for+SQL+Server" ... ) >>> engine = create_engine(pyodbc_uri, fast_executemany=True) >>> df.write_database( ... table_name="target_table", ... connection=engine, ... )
- write_delta(target: str | Path | deltalake.DeltaTable, *, mode: Literal['error', 'append', 'overwrite', 'ignore', 'merge'] = 'error', overwrite_schema: bool | None = None, storage_options: dict[str, str] | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', delta_write_options: dict[str, Any] | None = None, delta_merge_options: dict[str, Any] | None = None) deltalake.table.TableMerger | None[source]¶
Write DataFrame as delta table.
- Parameters:
- target
URI of a table or a DeltaTable object.
- mode{‘error’, ‘append’, ‘overwrite’, ‘ignore’, ‘merge’}
How to handle existing data.
If ‘error’, throw an error if the table already exists (default).
If ‘append’, will add new data.
If ‘overwrite’, will replace table with new data.
If ‘ignore’, will not write anything if table already exists.
If ‘merge’, return a TableMerger object to merge data from the DataFrame with the existing data.
- overwrite_schema
If True, allows updating the schema of the table.
Deprecated since version 0.20.14: Use the parameter delta_write_options instead and pass {“schema_mode”: “overwrite”}.
- storage_options
Extra options for the storage backends supported by deltalake. For cloud storages, this may include configurations for authentication etc.
- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- delta_write_options
Additional keyword arguments while writing a Delta lake Table. See a list of supported write options here.
- delta_merge_options
Keyword arguments which are required to MERGE a Delta lake Table. See a list of supported merge options here.
- Raises:
- TypeError
If the DataFrame contains unsupported data types.
- ArrowInvalidError
If the DataFrame contains data types that could not be cast to their primitive type.
- TableNotFoundError
If the delta table doesn’t exist and MERGE action is triggered
Notes
The Polars data types
NullandTimeare not supported by the delta protocol specification and will raise a TypeError. Columns using TheCategoricaldata type will be converted to normal (non-categorical) strings when written.Polars columns are always nullable. To write data to a delta table with non-nullable columns, a custom pyarrow schema has to be passed to the delta_write_options. See the last example below.
Examples
Write a dataframe to the local filesystem as a Delta Lake table.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> table_path = "/path/to/delta-table/" >>> df.write_delta(table_path)
Append data to an existing Delta Lake table on the local filesystem. Note that this will fail if the schema of the new data does not match the schema of the existing table.
>>> df.write_delta(table_path, mode="append")
Overwrite a Delta Lake table as a new version. If the schemas of the new and old data are the same, specifying the schema_mode is not required.
>>> existing_table_path = "/path/to/delta-table/" >>> df.write_delta( ... existing_table_path, ... mode="overwrite", ... delta_write_options={"schema_mode": "overwrite"}, ... )
Write a DataFrame as a Delta Lake table to a cloud object store like S3.
>>> table_path = "s3://bucket/prefix/to/delta-table/" >>> df.write_delta( ... table_path, ... storage_options={ ... "AWS_REGION": "THE_AWS_REGION", ... "AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID", ... "AWS_SECRET_ACCESS_KEY": "THE_AWS_SECRET_ACCESS_KEY", ... }, ... )
Write DataFrame as a Delta Lake table with non-nullable columns.
>>> import pyarrow as pa >>> existing_table_path = "/path/to/delta-table/" >>> df.write_delta( ... existing_table_path, ... delta_write_options={ ... "schema": pa.schema([pa.field("foo", pa.int64(), nullable=False)]) ... }, ... )
Write DataFrame as a Delta Lake table with zstd compression. For all delta_write_options keyword arguments, check the deltalake docs here, and for Writer Properties in particular here.
>>> import deltalake >>> df.write_delta( ... table_path, ... delta_write_options={ ... "writer_properties": deltalake.WriterProperties(compression="zstd"), ... }, ... )
Merge the DataFrame with an existing Delta Lake table. For all TableMerger methods, check the deltalake docs here.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> table_path = "/path/to/delta-table/" >>> ( ... df.write_delta( ... "table_path", ... mode="merge", ... delta_merge_options={ ... "predicate": "s.foo = t.foo", ... "source_alias": "s", ... "target_alias": "t", ... }, ... ) ... .when_matched_update_all() ... .when_not_matched_insert_all() ... .execute() ... )
- write_excel(workbook: str | Workbook | IO[bytes] | Path | None = None, worksheet: str | Worksheet | None = None, *, position: tuple[int, int] | str = 'A1', table_style: str | dict[str, Any] | None = None, table_name: str | None = None, column_formats: ColumnFormatDict | None = None, dtype_formats: dict[OneOrMoreDataTypes, str] | None = None, conditional_formats: ConditionalFormatDict | None = None, header_format: dict[str, Any] | None = None, column_totals: ColumnTotalsDefinition | None = None, column_widths: ColumnWidthsDefinition | None = None, row_totals: RowTotalsDefinition | None = None, row_heights: dict[int | tuple[int, ...], int] | int | None = None, sparklines: dict[str, Sequence[str] | dict[str, Any]] | None = None, formulas: dict[str, str | dict[str, str]] | None = None, float_precision: int = 3, include_header: bool = True, autofilter: bool = True, autofit: bool = False, hidden_columns: Sequence[str] | SelectorType | None = None, hide_gridlines: bool = False, sheet_zoom: int | None = None, freeze_panes: str | tuple[int, int] | tuple[str, int, int] | tuple[int, int, int, int] | None = None) Workbook[source]¶
Write frame data to a table in an Excel workbook/worksheet.
- Parameters:
- workbook{str, Workbook}
String name or path of the workbook to create, BytesIO object, file opened in binary-mode, or an xlsxwriter.Workbook object that has not been closed. If None, writes to dataframe.xlsx in the working directory.
- worksheet{str, Worksheet}
Name of target worksheet or an xlsxwriter.Worksheet object (in which case workbook must be the parent xlsxwriter.Workbook object); if None, writes to “Sheet1” when creating a new workbook (note that writing to an existing workbook requires a valid existing -or new- worksheet name).
- position{str, tuple}
Table position in Excel notation (eg: “A1”), or a (row,col) integer tuple.
- table_style{str, dict}
A named Excel table style, such as “Table Style Medium 4”, or a dictionary of {“key”:value,} options containing one or more of the following keys: “style”, “first_column”, “last_column”, “banded_columns, “banded_rows”.
- table_namestr
Name of the output table object in the worksheet; can then be referred to in the sheet by formulae/charts, or by subsequent xlsxwriter operations.
- column_formatsdict
A {colname(s):str,} or {selector:str,} dictionary for applying an Excel format string to the given columns. Formats defined here (such as “dd/mm/yyyy”, “0.00%”, etc) will override any defined in dtype_formats.
- dtype_formatsdict
A {dtype:str,} dictionary that sets the default Excel format for the given dtype. (This can be overridden on a per-column basis by the column_formats param).
- conditional_formatsdict
A dictionary of colname (or selector) keys to a format str, dict, or list that defines conditional formatting options for the specified columns.
If supplying a string typename, should be one of the valid xlsxwriter types such as “3_color_scale”, “data_bar”, etc.
If supplying a dictionary you can make use of any/all xlsxwriter supported options, including icon sets, formulae, etc.
Supplying multiple columns as a tuple/key will apply a single format across all columns - this is effective in creating a heatmap, as the min/max values will be determined across the entire range, not per-column.
Finally, you can also supply a list made up from the above options in order to apply more than one conditional format to the same range.
- header_formatdict
A {key:value,} dictionary of xlsxwriter format options to apply to the table header row, such as {“bold”:True, “font_color”:”#702963”}.
- column_totals{bool, list, dict}
Add a column-total row to the exported table.
If True, all numeric columns will have an associated total using “sum”.
If passing a string, it must be one of the valid total function names and all numeric columns will have an associated total using that function.
If passing a list of colnames, only those given will have a total.
For more control, pass a {colname:funcname,} dict.
Valid column-total function names are “average”, “count_nums”, “count”, “max”, “min”, “std_dev”, “sum”, and “var”.
- column_widths{dict, int}
A {colname:int,} or {selector:int,} dict or a single integer that sets (or overrides if autofitting) table column widths, in integer pixel units. If given as an integer the same value is used for all table columns.
- row_totals{dict, list, bool}
Add a row-total column to the right-hand side of the exported table.
If True, a column called “total” will be added at the end of the table that applies a “sum” function row-wise across all numeric columns.
If passing a list/sequence of column names, only the matching columns will participate in the sum.
Can also pass a {colname:columns,} dictionary to create one or more total columns with distinct names, referencing different columns.
- row_heights{dict, int}
An int or {row_index:int,} dictionary that sets the height of the given rows (if providing a dictionary) or all rows (if providing an integer) that intersect with the table body (including any header and total row) in integer pixel units. Note that row_index starts at zero and will be the header row (unless include_header is False).
- sparklinesdict
A {colname:list,} or {colname:dict,} dictionary defining one or more sparklines to be written into a new column in the table.
If passing a list of colnames (used as the source of the sparkline data) the default sparkline settings are used (eg: line chart with no markers).
For more control an xlsxwriter-compliant options dict can be supplied, in which case three additional polars-specific keys are available: “columns”, “insert_before”, and “insert_after”. These allow you to define the source columns and position the sparkline(s) with respect to other table columns. If no position directive is given, sparklines are added to the end of the table (eg: to the far right) in the order they are given.
- formulasdict
A {colname:formula,} or {colname:dict,} dictionary defining one or more formulas to be written into a new column in the table. Note that you are strongly advised to use structured references in your formulae wherever possible to make it simple to reference columns by name.
If providing a string formula (such as “=[@colx]*[@coly]”) the column will be added to the end of the table (eg: to the far right), after any default sparklines and before any row_totals.
For the most control supply an options dictionary with the following keys: “formula” (mandatory), one of “insert_before” or “insert_after”, and optionally “return_dtype”. The latter is used to appropriately format the output of the formula and allow it to participate in row/column totals.
- float_precisionint
Default number of decimals displayed for floating point columns (note that this is purely a formatting directive; the actual values are not rounded).
- include_headerbool
Indicate if the table should be created with a header row.
- autofilterbool
If the table has headers, provide autofilter capability.
- autofitbool
Calculate individual column widths from the data.
- hidden_columnsstr | list
A column name, list of column names, or a selector representing table columns to mark as hidden in the output worksheet.
- hide_gridlinesbool
Do not display any gridlines on the output worksheet.
- sheet_zoomint
Set the default zoom level of the output worksheet.
- freeze_panesstr | (str, int, int) | (int, int) | (int, int, int, int)
Freeze workbook panes.
If (row, col) is supplied, panes are split at the top-left corner of the specified cell, which are 0-indexed. Thus, to freeze only the top row, supply (1, 0).
Alternatively, cell notation can be used to supply the cell. For example, “A2” indicates the split occurs at the top-left of cell A2, which is the equivalent of (1, 0).
If (row, col, top_row, top_col) are supplied, the panes are split based on the row and col, and the scrolling region is initialized to begin at the top_row and top_col. Thus, to freeze only the top row and have the scrolling region begin at row 10, column D (5th col), supply (1, 0, 9, 4). Using cell notation for (row, col), supplying (“A2”, 9, 4) is equivalent.
Notes
A list of compatible xlsxwriter format property names can be found here: https://xlsxwriter.readthedocs.io/format.html#format-methods-and-format-properties
Conditional formatting dictionaries should provide xlsxwriter-compatible definitions; polars will take care of how they are applied on the worksheet with respect to the relative sheet/column position. For supported options, see: https://xlsxwriter.readthedocs.io/working_with_conditional_formats.html
Similarly, sparkline option dictionaries should contain xlsxwriter-compatible key/values, as well as a mandatory polars “columns” key that defines the sparkline source data; these source columns should all be adjacent. Two other polars-specific keys are available to help define where the sparkline appears in the table: “insert_after”, and “insert_before”. The value associated with these keys should be the name of a column in the exported table. https://xlsxwriter.readthedocs.io/working_with_sparklines.html
Formula dictionaries must contain a key called “formula”, and then optional “insert_after”, “insert_before”, and/or “return_dtype” keys. These additional keys allow the column to be injected into the table at a specific location, and/or to define the return type of the formula (eg: “Int64”, “Float64”, etc). Formulas that refer to table columns should use Excel’s structured references syntax to ensure the formula is applied correctly and is table-relative. https://support.microsoft.com/en-us/office/using-structured-references-with-excel-tables-f5ed2452-2337-4f71-bed3-c8ae6d2b276e
If you want unformatted output, you can use a selector to apply the “General” format to all columns (or all non-temporal columns to preserve formatting of date/datetime columns), eg: column_formats={~cs.temporal(): “General”}.
Examples
Instantiate a basic DataFrame:
>>> from random import uniform >>> from datetime import date >>> >>> df = pl.DataFrame( ... { ... "dtm": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)], ... "num": [uniform(-500, 500), uniform(-500, 500), uniform(-500, 500)], ... "val": [10_000, 20_000, 30_000], ... } ... )
Export to “dataframe.xlsx” (the default workbook name, if not specified) in the working directory, add column totals on all numeric columns (“sum” by default), then autofit:
>>> df.write_excel(column_totals=True, autofit=True)
Write frame to a specific location on the sheet, set a named table style, apply US-style date formatting, increase floating point formatting precision, apply a non-default column total function to a specific column, autofit:
>>> df.write_excel( ... position="B4", ... table_style="Table Style Light 16", ... dtype_formats={pl.Date: "mm/dd/yyyy"}, ... column_totals={"num": "average"}, ... float_precision=6, ... autofit=True, ... )
Write the same frame to a named worksheet twice, applying different styles and conditional formatting to each table, adding custom-formatted table titles using explicit xlsxwriter integration:
>>> from xlsxwriter import Workbook >>> with Workbook("multi_frame.xlsx") as wb: ... # basic/default conditional formatting ... df.write_excel( ... workbook=wb, ... worksheet="data", ... position=(3, 1), # specify position as (row,col) coordinates ... conditional_formats={"num": "3_color_scale", "val": "data_bar"}, ... table_style="Table Style Medium 4", ... ) ... ... # advanced conditional formatting, custom styles ... df.write_excel( ... workbook=wb, ... worksheet="data", ... position=(df.height + 7, 1), ... table_style={ ... "style": "Table Style Light 4", ... "first_column": True, ... }, ... conditional_formats={ ... "num": { ... "type": "3_color_scale", ... "min_color": "#76933c", ... "mid_color": "#c4d79b", ... "max_color": "#ebf1de", ... }, ... "val": { ... "type": "data_bar", ... "data_bar_2010": True, ... "bar_color": "#9bbb59", ... "bar_negative_color_same": True, ... "bar_negative_border_color_same": True, ... }, ... }, ... column_formats={"num": "#,##0.000;[White]-#,##0.000"}, ... column_widths={"val": 125}, ... autofit=True, ... ) ... ... # add some table titles (with a custom format) ... ws = wb.get_worksheet_by_name("data") ... fmt_title = wb.add_format( ... { ... "font_color": "#4f6228", ... "font_size": 12, ... "italic": True, ... "bold": True, ... } ... ) ... ws.write(2, 1, "Basic/default conditional formatting", fmt_title) ... ws.write(df.height + 6, 1, "Custom conditional formatting", fmt_title)
Export a table containing two different types of sparklines. Use default options for the “trend” sparkline and customized options (and positioning) for the “+/-” win_loss sparkline, with non-default integer formatting, column totals, a subtle two-tone heatmap and hidden worksheet gridlines:
>>> df = pl.DataFrame( ... { ... "id": ["aaa", "bbb", "ccc", "ddd", "eee"], ... "q1": [100, 55, -20, 0, 35], ... "q2": [30, -10, 15, 60, 20], ... "q3": [-50, 0, 40, 80, 80], ... "q4": [75, 55, 25, -10, -55], ... } ... ) >>> df.write_excel( ... table_style="Table Style Light 2", ... # apply accounting format to all flavours of integer ... dtype_formats={dt: "#,##0_);(#,##0)" for dt in [pl.Int32, pl.Int64]}, ... sparklines={ ... # default options; just provide source cols ... "trend": ["q1", "q2", "q3", "q4"], ... # customized sparkline type, with positioning directive ... "+/-": { ... "columns": ["q1", "q2", "q3", "q4"], ... "insert_after": "id", ... "type": "win_loss", ... }, ... }, ... conditional_formats={ ... # create a unified multi-column heatmap ... ("q1", "q2", "q3", "q4"): { ... "type": "2_color_scale", ... "min_color": "#95b3d7", ... "max_color": "#ffffff", ... }, ... }, ... column_totals=["q1", "q2", "q3", "q4"], ... row_totals=True, ... hide_gridlines=True, ... )
Export a table containing an Excel formula-based column that calculates a standardised Z-score, showing use of structured references in conjunction with positioning directives, column totals, and custom formatting.
>>> df = pl.DataFrame( ... { ... "id": ["a123", "b345", "c567", "d789", "e101"], ... "points": [99, 45, 50, 85, 35], ... } ... ) >>> df.write_excel( ... table_style={ ... "style": "Table Style Medium 15", ... "first_column": True, ... }, ... column_formats={ ... "id": {"font": "Consolas"}, ... "points": {"align": "center"}, ... "z-score": {"align": "center"}, ... }, ... column_totals="average", ... formulas={ ... "z-score": { ... # use structured references to refer to the table columns and 'totals' row ... "formula": "=STANDARDIZE([@points], [[#Totals],[points]], STDEV([points]))", ... "insert_after": "points", ... "return_dtype": pl.Float64, ... } ... }, ... hide_gridlines=True, ... sheet_zoom=125, ... )
Create and reference a Worksheet object directly, adding a basic chart. Taking advantage of structured references to set chart series values and categories is strongly recommended so you do not have to calculate cell positions with respect to the frame data and worksheet:
>>> with Workbook("basic_chart.xlsx") as wb: ... # create worksheet object and write frame data to it ... ws = wb.add_worksheet("demo") ... df.write_excel( ... workbook=wb, ... worksheet=ws, ... table_name="DataTable", ... table_style="Table Style Medium 26", ... hide_gridlines=True, ... ) ... # create chart object, point to the written table ... # data using structured references, and style it ... chart = wb.add_chart({"type": "column"}) ... chart.set_title({"name": "Example Chart"}) ... chart.set_legend({"none": True}) ... chart.set_style(38) ... chart.add_series( ... { # note the use of structured references ... "values": "=DataTable[points]", ... "categories": "=DataTable[id]", ... "data_labels": {"value": True}, ... } ... ) ... # add chart to the worksheet ... ws.insert_chart("D1", chart)
Export almost entirely unformatted data (no numeric styling or standardised floating point precision), omit autofilter, but keep date/datetime formatting:
>>> import polars.selectors as cs >>> df = pl.DataFrame( ... { ... "n1": [-100, None, 200, 555], ... "n2": [987.4321, -200, 44.444, 555.5], ... } ... ) >>> df.write_excel( ... column_formats={~cs.temporal(): "General"}, ... autofilter=False, ... )
- write_iceberg(target: str | pyiceberg.table.Table, mode: Literal['append', 'overwrite']) None[source]¶
Write DataFrame to an Iceberg table.
Warning
This functionality is currently considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- target
Name of the table or the Table object representing an Iceberg table.
- mode{‘append’, ‘overwrite’}
How to handle existing data.
If ‘append’, will add new data.
If ‘overwrite’, will replace table with new data.
- write_ipc(file: str | Path | IO[bytes] | None, *, compression: IpcCompression = 'uncompressed', compat_level: CompatLevel | None = None, storage_options: dict[str, Any] | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int = 2) BytesIO | None[source]¶
Write to Arrow IPC binary stream or Feather file.
See “File or Random Access format” in https://arrow.apache.org/docs/python/ipc.html.
Changed in version 1.1: The future parameter was renamed compat_level.
- Parameters:
- file
Path or writable file-like object to which the IPC data will be written. If set to None, the output is returned as a BytesIO object.
- compression{‘uncompressed’, ‘lz4’, ‘zstd’}
Compression method. Defaults to “uncompressed”.
- compat_level
Use a specific compatibility level when exporting Polars’ internal data structures.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (hf://): Accepts an API key under the token parameter: {‘token’: ‘…’}, or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
Examples
>>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.arrow" >>> df.write_ipc(path)
- write_ipc_stream(file: str | Path | IO[bytes] | None, *, compression: IpcCompression = 'uncompressed', compat_level: CompatLevel | None = None) BytesIO | None[source]¶
Write to Arrow IPC record batch stream.
See “Streaming format” in https://arrow.apache.org/docs/python/ipc.html.
Changed in version 1.1: The future parameter was renamed compat_level.
- Parameters:
- file
Path or writable file-like object to which the IPC record batch data will be written. If set to None, the output is returned as a BytesIO object.
- compression{‘uncompressed’, ‘lz4’, ‘zstd’}
Compression method. Defaults to “uncompressed”.
- compat_level
Use a specific compatibility level when exporting Polars’ internal data structures.
Examples
>>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.arrow" >>> df.write_ipc_stream(path)
- write_json(file: IOBase | str | Path | None = None) str | None[source]¶
Serialize to JSON representation.
- Parameters:
- file
File path or writable file-like object to which the result will be written. If set to None (default), the output is returned as a string instead.
See also
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... } ... ) >>> df.write_json() '[{"foo":1,"bar":6},{"foo":2,"bar":7},{"foo":3,"bar":8}]'
- write_ndjson(file: str | Path | IO[str] | IO[bytes] | None = None) str | None[source]¶
Serialize to newline delimited JSON representation.
- Parameters:
- file
File path or writable file-like object to which the result will be written. If set to None (default), the output is returned as a string instead.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... } ... ) >>> df.write_ndjson() '{"foo":1,"bar":6}\n{"foo":2,"bar":7}\n{"foo":3,"bar":8}\n'
- write_parquet(file: str | Path | IO[bytes], *, compression: ParquetCompression = 'zstd', compression_level: int | None = None, statistics: bool | str | dict[str, bool] = True, row_group_size: int | None = None, data_page_size: int | None = None, use_pyarrow: bool = False, pyarrow_options: dict[str, Any] | None = None, partition_by: str | Sequence[str] | None = None, partition_chunk_size_bytes: int = 4294967296, storage_options: dict[str, Any] | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int = 2, metadata: ParquetMetadata | None = None, mkdir: bool = False) None[source]¶
Write to Apache Parquet file.
- Parameters:
- file
File path or writable file-like object to which the result will be written. This should be a path to a directory if writing a partitioned dataset.
- compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}
Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.
- compression_level
The level of compression to use. Higher compression means smaller files on disk.
“gzip” : min-level: 0, max-level: 9.
“brotli” : min-level: 0, max-level: 11.
“zstd” : min-level: 1, max-level: 22.
- statistics
Write statistics to the parquet headers. This is the default behavior.
Possible values:
True: enable default set of statistics (default). Some statistics may be disabled.
False: disable all statistics
“full”: calculate and write all available statistics. Cannot be combined with use_pyarrow.
{ “statistic-key”: True / False, … }. Cannot be combined with use_pyarrow. Available keys:
“min”: column minimum value (default: True)
“max”: column maximum value (default: True)
“distinct_count”: number of unique column values (default: False)
“null_count”: number of null values in column (default: True)
- row_group_size
Size of the row groups in number of rows. Defaults to 512^2 rows.
- data_page_size
Size of the data page in bytes. Defaults to 1024^2 bytes.
- use_pyarrow
Use C++ parquet implementation vs Rust parquet implementation. At the moment C++ supports more features.
- pyarrow_options
Arguments passed to pyarrow.parquet.write_table.
If you pass partition_cols here, the dataset will be written using pyarrow.parquet.write_to_dataset. The partition_cols parameter leads to write the dataset to a directory. Similar to Spark’s partitioned datasets.
- partition_by
Column(s) to partition by. A partitioned dataset will be written if this is specified. This parameter is considered unstable and is subject to change.
- partition_chunk_size_bytes
Approximate size to split DataFrames within a single partition when writing. Note this is calculated using the size of the DataFrame in memory - the size of the output file may differ depending on the file format / compression.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (hf://): Accepts an API key under the token parameter: {‘token’: ‘…’}, or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
- metadata
A dictionary or callback to add key-values to the file-level Parquet metadata.
Warning
This functionality is considered experimental. It may be removed or changed at any point without it being considered a breaking change.
- mkdir: bool
Recursively create all the directories in the path.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Examples
>>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.parquet" >>> df.write_parquet(path)
We can use pyarrow with use_pyarrow_write_to_dataset=True to write partitioned datasets. The following example will write the first row to ../watermark=1/.parquet and the other rows to ../watermark=2/.parquet.
>>> df = pl.DataFrame({"a": [1, 2, 3], "watermark": [1, 2, 2]}) >>> path: pathlib.Path = dirpath / "partitioned_object" >>> df.write_parquet( ... path, ... use_pyarrow=True, ... pyarrow_options={"partition_cols": ["watermark"]}, ... )
- class dataframely.Date(*, nullable: bool | None = None, primary_key: bool = False, min: date | None = None, min_exclusive: date | None = None, max: date | None = None, max_exclusive: date | None = None, resolution: str | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
OrdinalMixin[date],ColumnA column of dates (without time).
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Datetime(*, nullable: bool | None = None, primary_key: bool = False, min: datetime | None = None, min_exclusive: datetime | None = None, max: datetime | None = None, max_exclusive: datetime | None = None, resolution: str | None = None, time_zone: str | tzinfo | None = None, time_unit: Literal['ns', 'us', 'ms'] = 'us', check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
OrdinalMixin[datetime],ColumnA column of datetimes.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Decimal(precision: int | None = None, scale: int = 0, *, nullable: bool | None = None, primary_key: bool = False, min: Decimal | None = None, min_exclusive: Decimal | None = None, max: Decimal | None = None, max_exclusive: Decimal | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
OrdinalMixin[Decimal],ColumnA column of decimal values with given precision and scale.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Duration(*, nullable: bool | None = None, primary_key: bool = False, min: timedelta | None = None, min_exclusive: timedelta | None = None, max: timedelta | None = None, max_exclusive: timedelta | None = None, resolution: str | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
OrdinalMixin[timedelta],ColumnA column of durations.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Enum(categories: Series | Iterable[str] | type[Enum], *, nullable: bool | None = None, primary_key: bool = False, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ColumnA column of enum (string) values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.FailureInfo(lf: LazyFrame, rule_columns: list[str], schema: type[S])[source]¶
Bases:
Generic[S]A container carrying information about rows failing validation in
Schema.filter().Methods
The number of validation failures per co-occurring rule validation failure.
counts()The number of validation failures for each individual rule.
invalid()The rows of the original data frame containing the invalid rows.
read_delta(source, **kwargs)Read a delta lake table with the failure info.
read_parquet(source, **kwargs)Read a parquet file with the failure info.
scan_delta(source, **kwargs)Lazily read a delta lake table with the failure info.
scan_parquet(source, **kwargs)Lazily read a parquet file with the failure info.
sink_parquet(file, **kwargs)Stream the failure info to a parquet file.
write_delta(target, **kwargs)Write the failure info to a delta lake table.
write_parquet(file, **kwargs)Write the failure info to a parquet file.
- cooccurrence_counts() dict[frozenset[str], int][source]¶
The number of validation failures per co-occurring rule validation failure.
In contrast to
counts(), this method provides additional information on whether a rule often fails because of another rule failing.- Returns:
A list providing tuples of (1) co-occurring rule validation failures and (2) the count of such failures.
- Attention:
This method should primarily be used for debugging as it is much slower than
counts().
- counts() dict[str, int][source]¶
The number of validation failures for each individual rule.
- Returns:
A mapping from rule name to counts. If a rule’s failure count is 0, it is not included here.
- classmethod read_delta(source: str | Path | deltalake.DeltaTable, **kwargs: Any) FailureInfo[Schema][source]¶
Read a delta lake table with the failure info.
- Args:
source: Path or DeltaTable from which to read the data. kwargs: Additional keyword arguments passed directly to
polars.read_delta().- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- classmethod read_parquet(source: str | Path | IO[bytes], **kwargs: Any) FailureInfo[Schema][source]¶
Read a parquet file with the failure info.
- Args:
source: Path, directory, or file-like object from which to read the data. kwargs: Additional keyword arguments passed directly to
polars.read_parquet().- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize()
- classmethod scan_delta(source: str | Path | deltalake.DeltaTable, **kwargs: Any) FailureInfo[Schema][source]¶
Lazily read a delta lake table with the failure info.
- Args:
source: Path or DeltaTable from which to read the data. kwargs: Additional keyword arguments passed directly to
polars.scan_delta().- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- classmethod scan_parquet(source: str | Path | IO[bytes], **kwargs: Any) FailureInfo[Schema][source]¶
Lazily read a parquet file with the failure info.
- Args:
source: Path, directory, or file-like object from which to read the data.
- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize()
- schema: type[S]¶
The schema used to create the input data frame.
- sink_parquet(file: str | Path | IO[bytes] | PartitioningScheme, **kwargs: Any) None[source]¶
Stream the failure info to a parquet file.
- Args:
- file: The file path or writable file-like object to which to write the
parquet file. This should be a path to a directory if writing a partitioned dataset.
- kwargs: Additional keyword arguments passed directly to
polars.sink_parquet().metadatamay only be provided if it is a dictionary.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- write_delta(target: str | Path | deltalake.DeltaTable, **kwargs: Any) None[source]¶
Write the failure info to a delta lake table.
- Args:
target: The file path or DeltaTable to which to write the delta lake data. kwargs: Additional keyword arguments passed directly to
polars.write_delta().- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- write_parquet(file: str | Path | IO[bytes], **kwargs: Any) None[source]¶
Write the failure info to a parquet file.
- Args:
- file: The file path or writable file-like object to which to write the
parquet file. This should be a path to a directory if writing a partitioned dataset.
- kwargs: Additional keyword arguments passed directly to
polars.write_parquet().metadatamay only be provided if it is a dictionary.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- class dataframely.Float(*, nullable: bool | None = None, primary_key: bool = False, allow_inf_nan: bool = False, min: float | None = None, min_exclusive: float | None = None, max: float | None = None, max_exclusive: float | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseFloatA column of floats (with any number of bytes).
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 1.7976931348623157e+308¶
- min_value = -1.7976931348623157e+308¶
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Float32(*, nullable: bool | None = None, primary_key: bool = False, allow_inf_nan: bool = False, min: float | None = None, min_exclusive: float | None = None, max: float | None = None, max_exclusive: float | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseFloatA column of float32 (“float”) values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 3.4028234663852886e+38¶
- min_value = -3.4028234663852886e+38¶
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Float64(*, nullable: bool | None = None, primary_key: bool = False, allow_inf_nan: bool = False, min: float | None = None, min_exclusive: float | None = None, max: float | None = None, max_exclusive: float | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseFloatA column of float64 (“double”) values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 1.7976931348623157e+308¶
- min_value = -1.7976931348623157e+308¶
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Int16(*, nullable: bool | None = None, primary_key: bool = False, min: int | None = None, min_exclusive: int | None = None, max: int | None = None, max_exclusive: int | None = None, is_in: Sequence[int] | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseIntegerA column of int16 values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- is_unsigned = False¶
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 32767¶
- min_value = -32768¶
- property name: str¶
Get the name of the column in a schema.
- num_bytes = 2¶
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Int32(*, nullable: bool | None = None, primary_key: bool = False, min: int | None = None, min_exclusive: int | None = None, max: int | None = None, max_exclusive: int | None = None, is_in: Sequence[int] | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseIntegerA column of int32 values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- is_unsigned = False¶
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 2147483647¶
- min_value = -2147483648¶
- property name: str¶
Get the name of the column in a schema.
- num_bytes = 4¶
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Int64(*, nullable: bool | None = None, primary_key: bool = False, min: int | None = None, min_exclusive: int | None = None, max: int | None = None, max_exclusive: int | None = None, is_in: Sequence[int] | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseIntegerA column of int64 values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- is_unsigned = False¶
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 9223372036854775807¶
- min_value = -9223372036854775808¶
- property name: str¶
Get the name of the column in a schema.
- num_bytes = 8¶
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Int8(*, nullable: bool | None = None, primary_key: bool = False, min: int | None = None, min_exclusive: int | None = None, max: int | None = None, max_exclusive: int | None = None, is_in: Sequence[int] | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseIntegerA column of int8 values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- is_unsigned = False¶
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 127¶
- min_value = -128¶
- property name: str¶
Get the name of the column in a schema.
- num_bytes = 1¶
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Integer(*, nullable: bool | None = None, primary_key: bool = False, min: int | None = None, min_exclusive: int | None = None, max: int | None = None, max_exclusive: int | None = None, is_in: Sequence[int] | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseIntegerA column of integers (with any number of bytes).
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- is_unsigned = False¶
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 9223372036854775807¶
- min_value = -9223372036854775808¶
- property name: str¶
Get the name of the column in a schema.
- num_bytes = 8¶
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.LazyFrame(data: FrameInitTypes | None = None, schema: SchemaDefinition | None = None, *, schema_overrides: SchemaDict | None = None, strict: bool = True, orient: Orientation | None = None, infer_schema_length: int | None = 100, nan_to_null: bool = False)[source]¶
Bases:
LazyFrame,Generic[S]Generic wrapper around a
polars.LazyFrameto attach schema information.This class is merely used for the type system and never actually instantiated. This means that it won’t exist at runtime and
isinstance(LazyFrame, <var>)will always fail. Accordingly, users should not try to create instances of this class.- Attributes:
Methods
Approximate count of unique values.
bottom_k(k, *, by[, reverse])Return the k smallest rows.
cast(dtypes, *[, strict])Cast LazyFrame column(s) to the specified dtype(s).
collect(*args, **kwargs)Materialize this LazyFrame into a DataFrame.
Collect DataFrame asynchronously in thread pool.
Resolve the schema of this LazyFrame.
count()Return the number of non-null elements for each column.
describe([percentiles, interpolation])Creates a summary of statistics for a LazyFrame, returning a DataFrame.
deserialize(source, *[, format])Read a logical plan from a file to construct a LazyFrame.
drop(*columns[, strict])Remove columns from the DataFrame.
drop_nans([subset])Drop all rows that contain one or more NaN values.
drop_nulls([subset])Drop all rows that contain one or more null values.
explain(*[, format, optimized, ...])Create a string representation of the query plan.
explode(columns, *more_columns)Explode the DataFrame to long format by exploding the given columns.
fetch([n_rows])Collect a small number of rows for debugging purposes.
fill_nan(value)Fill floating point NaN values.
fill_null([value, strategy, limit, ...])Fill null values using the specified value or strategy.
filter(*predicates, **constraints)Filter rows in the LazyFrame based on a predicate expression.
first()Get the first row of the DataFrame.
gather_every(n[, offset])Take every nth row in the LazyFrame and return as a new LazyFrame.
group_by(*by[, maintain_order])Start a group by operation.
group_by_dynamic(index_column, *, every[, ...])Group based on a time value (or index value of type Int32, Int64).
head([n])Get the first n rows.
inspect([fmt])Inspect a node in the computation graph.
Interpolate intermediate values.
join(other[, on, how, left_on, right_on, ...])Add a join operation to the Logical Plan.
join_asof(other, *[, left_on, right_on, on, ...])Perform an asof join.
join_where(other, *predicates[, suffix])Perform a join based on one or multiple (in)equality predicates.
last()Get the last row of the DataFrame.
limit([n])Get the first n rows.
map_batches(function, *[, ...])Apply a custom function.
match_to_schema(schema, *[, ...])Match or evolve the schema of a LazyFrame into a specific schema.
max()Aggregate the columns in the LazyFrame to their maximum value.
mean()Aggregate the columns in the LazyFrame to their mean value.
median()Aggregate the columns in the LazyFrame to their median value.
melt([id_vars, value_vars, variable_name, ...])Unpivot a DataFrame from wide to long format.
merge_sorted(other, key)Take two sorted DataFrames and merge them by the sorted key.
min()Aggregate the columns in the LazyFrame to their minimum value.
Aggregate the columns in the LazyFrame as the sum of their null value count.
pipe(function, *args, **kwargs)Offers a structured way to apply a sequence of user-defined functions (UDFs).
pipe_with_schema(function)Allows to alter the lazy frame during the plan stage with the resolved schema.
profile(*[, type_coercion, ...])Profile a LazyFrame.
quantile(quantile[, interpolation])Aggregate the columns in the LazyFrame to their quantile value.
remote([context, plan_type])Run a query remotely on Polars Cloud.
remove(*predicates, **constraints)Remove rows, dropping those that match the given predicate expression(s).
rename(mapping, *[, strict])Rename column names.
reverse()Reverse the DataFrame.
rolling(index_column, *, period[, offset, ...])Create rolling groups based on a temporal or integer column.
select(*exprs, **named_exprs)Select columns from this LazyFrame.
select_seq(*exprs, **named_exprs)Select columns from this LazyFrame.
Serialize the logical plan of this LazyFrame to a file or string in JSON format.
shift([n, fill_value])Shift values by the given number of indices.
show_graph(*[, optimized, show, ...])Show a plot of the query plan.
sink_csv()Evaluate the query in streaming mode and write to a CSV file.
sink_ipc()Evaluate the query in streaming mode and write to an IPC file.
Evaluate the query in streaming mode and write to an NDJSON file.
Evaluate the query in streaming mode and write to a Parquet file.
slice(offset[, length])Get a slice of this DataFrame.
sort(by, *more_by[, descending, nulls_last, ...])Sort the LazyFrame by the given columns.
sql(query, *[, table_name])Execute a SQL query against the LazyFrame.
std([ddof])Aggregate the columns in the LazyFrame to their standard deviation value.
sum()Aggregate the columns in the LazyFrame to their sum value.
tail([n])Get the last n rows.
top_k(k, *, by[, reverse])Return the k largest rows.
unique([subset, keep, maintain_order])Drop duplicate rows from this DataFrame.
unnest(columns, *more_columns)Decompose struct columns into separate columns for each of their fields.
unpivot([on, index, variable_name, ...])Unpivot a DataFrame from wide to long format.
update(other[, on, how, left_on, right_on, ...])Update the values in this LazyFrame with the values in other.
var([ddof])Aggregate the columns in the LazyFrame to their variance value.
with_columns(*exprs, **named_exprs)Add columns to this LazyFrame.
with_columns_seq(*exprs, **named_exprs)Add columns to this LazyFrame.
with_context(other)Add an external context to the computation graph.
with_row_count([name, offset])Add a column at index 0 that counts the rows.
with_row_index([name, offset])Add a row index as the first column in the LazyFrame.
- approx_n_unique() LazyFrame[source]¶
Approximate count of unique values.
Deprecated since version 0.20.11: Use select(pl.all().approx_n_unique()) instead.
This is done using the HyperLogLog++ algorithm for cardinality estimation.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.approx_n_unique().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ u32 ┆ u32 │ ╞═════╪═════╡ │ 4 ┆ 2 │ └─────┴─────┘
- bottom_k(k: int, *, by: IntoExpr | Iterable[IntoExpr], reverse: bool | Sequence[bool] = False) LazyFrame[source]¶
Return the k smallest rows.
Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call
sort()after this function if you wish the output to be sorted.Changed in version 1.0.0: The descending parameter was renamed reverse.
- Parameters:
- k
Number of rows to return.
- by
Column(s) used to determine the bottom rows. Accepts expression input. Strings are parsed as column names.
- reverse
Consider the k largest elements of the by column(s) (instead of the k smallest). This can be specified per column by passing a sequence of booleans.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... )
Get the rows which contain the 4 smallest values in column b.
>>> lf.bottom_k(4, by="b").collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 1 │ │ a ┆ 1 │ │ c ┆ 1 │ │ a ┆ 2 │ └─────┴─────┘
Get the rows which contain the 4 smallest values when sorting on column a and b.
>>> lf.bottom_k(4, by=["a", "b"]).collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ b ┆ 1 │ │ b ┆ 2 │ └─────┴─────┘
- cache = None¶
- cast(dtypes: Mapping[ColumnNameOrSelector | PolarsDataType, PolarsDataType | PythonDataType] | PolarsDataType | pl.DataTypeExpr, *, strict: bool = True) LazyFrame[source]¶
Cast LazyFrame column(s) to the specified dtype(s).
- Parameters:
- dtypes
Mapping of column names (or selector) to dtypes, or a single dtype to which all columns will be cast.
- strict
Throw an error if a cast could not be done (for instance, due to an overflow).
Examples
>>> from datetime import date >>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": [date(2020, 1, 2), date(2021, 3, 4), date(2022, 5, 6)], ... } ... )
Cast specific frame columns to the specified dtypes:
>>> lf.cast({"foo": pl.Float32, "bar": pl.UInt8}).collect() shape: (3, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f32 ┆ u8 ┆ date │ ╞═════╪═════╪════════════╡ │ 1.0 ┆ 6 ┆ 2020-01-02 │ │ 2.0 ┆ 7 ┆ 2021-03-04 │ │ 3.0 ┆ 8 ┆ 2022-05-06 │ └─────┴─────┴────────────┘
Cast all frame columns matching one dtype (or dtype group) to another dtype:
>>> lf.cast({pl.Date: pl.Datetime}).collect() shape: (3, 3) ┌─────┬─────┬─────────────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ datetime[μs] │ ╞═════╪═════╪═════════════════════╡ │ 1 ┆ 6.0 ┆ 2020-01-02 00:00:00 │ │ 2 ┆ 7.0 ┆ 2021-03-04 00:00:00 │ │ 3 ┆ 8.0 ┆ 2022-05-06 00:00:00 │ └─────┴─────┴─────────────────────┘
Use selectors to define the columns being cast:
>>> import polars.selectors as cs >>> lf.cast({cs.numeric(): pl.UInt32, cs.temporal(): pl.String}).collect() shape: (3, 3) ┌─────┬─────┬────────────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ str │ ╞═════╪═════╪════════════╡ │ 1 ┆ 6 ┆ 2020-01-02 │ │ 2 ┆ 7 ┆ 2021-03-04 │ │ 3 ┆ 8 ┆ 2022-05-06 │ └─────┴─────┴────────────┘
Cast all frame columns to the specified dtype:
>>> lf.cast(pl.String).collect().to_dict(as_series=False) {'foo': ['1', '2', '3'], 'bar': ['6.0', '7.0', '8.0'], 'ham': ['2020-01-02', '2021-03-04', '2022-05-06']}
- clear = None¶
- clone = None¶
- collect(*args: Any, **kwargs: Any) DataFrame[S][source]¶
Materialize this LazyFrame into a DataFrame.
By default, all query optimizations are enabled. Individual optimizations may be disabled by setting the corresponding parameter to False.
- Parameters:
- type_coercion
Do type coercion optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- predicate_pushdown
Do predicate pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- projection_pushdown
Do projection pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- simplify_expression
Run simplify expressions optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- slice_pushdown
Slice pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
Deprecated since version 1.30.0: Use the optimizations parameters.
- comm_subexpr_elim
Common subexpressions will be cached and reused.
Deprecated since version 1.30.0: Use the optimizations parameters.
- cluster_with_columns
Combine sequential independent calls to with_columns
Deprecated since version 1.30.0: Use the optimizations parameters.
- collapse_joins
Collapse a join and filters into a faster join
Deprecated since version 1.30.0: Use the optimizations parameters.
- no_optimization
Turn off (certain) optimizations.
Deprecated since version 1.30.0: Use the optimizations parameters.
- engine
Select the engine used to process the query, optional. At the moment, if set to “auto” (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine. If set to “gpu”, the GPU engine is used. Fine-grained control over the GPU engine, for example which device to use on a system with multiple devices, is possible by providing a
GPUEngineobject with configuration options.Note
GPU mode is considered unstable. Not all queries will run successfully on the GPU, however, they should fall back transparently to the default engine if execution is not supported.
Running with POLARS_VERBOSE=1 will provide information if a query falls back (and why).
Note
The GPU engine does not support streaming, or running in the background. If either are enabled, then GPU execution is switched off.
- background
Run the query in the background and get a handle to the query. This handle can be used to fetch the result or cancel the query.
Warning
Background mode is considered unstable. It may be changed at any point without it being considered a breaking change.
- optimizations
The optimization passes done during query optimization.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Returns:
- DataFrame
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a").agg(pl.all().sum()).collect() shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 4 ┆ 10 │ │ b ┆ 11 ┆ 10 │ │ c ┆ 6 ┆ 1 │ └─────┴─────┴─────┘
Collect in streaming mode
>>> lf.group_by("a").agg(pl.all().sum()).collect( ... engine="streaming" ... ) shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 4 ┆ 10 │ │ b ┆ 11 ┆ 10 │ │ c ┆ 6 ┆ 1 │ └─────┴─────┴─────┘
Collect in GPU mode
>>> lf.group_by("a").agg(pl.all().sum()).collect(engine="gpu") shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 11 ┆ 10 │ │ a ┆ 4 ┆ 10 │ │ c ┆ 6 ┆ 1 │ └─────┴─────┴─────┘
With control over the device used
>>> lf.group_by("a").agg(pl.all().sum()).collect( ... engine=pl.GPUEngine(device=1) ... ) shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 11 ┆ 10 │ │ a ┆ 4 ┆ 10 │ │ c ┆ 6 ┆ 1 │ └─────┴─────┴─────┘
- collect_async(*, gevent: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = <polars.lazyframe.opt_flags.QueryOptFlags object>) Awaitable[DataFrame] | _GeventDataFrameResult[DataFrame][source]¶
Collect DataFrame asynchronously in thread pool.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Collects into a DataFrame (like
collect()) but, instead of returning a DataFrame directly, it is scheduled to be collected inside a thread pool, while this method returns almost instantly.This can be useful if you use gevent or asyncio and want to release control to other greenlets/tasks while LazyFrames are being collected.
- Parameters:
- gevent
Return wrapper to gevent.event.AsyncResult instead of Awaitable
- engine
Select the engine used to process the query, optional. At the moment, if set to “auto” (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine.
Note
The GPU engine does not support async, or running in the background. If either are enabled, then GPU execution is switched off.
- optimizations
The optimization passes done during query optimization.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Returns:
- If gevent=False (default) then returns an awaitable.
- If gevent=True then returns wrapper that has a
- .get(block=True, timeout=None) method.
See also
polars.collect_allCollect multiple LazyFrames at the same time.
polars.collect_all_asyncCollect multiple LazyFrames at the same time lazily.
Notes
In case of error set_exception is used on asyncio.Future/gevent.event.AsyncResult and will be reraised by them.
Examples
>>> import asyncio >>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> async def main(): ... return await ( ... lf.group_by("a", maintain_order=True) ... .agg(pl.all().sum()) ... .collect_async() ... ) >>> asyncio.run(main()) shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 4 ┆ 10 │ │ b ┆ 11 ┆ 10 │ │ c ┆ 6 ┆ 1 │ └─────┴─────┴─────┘
- collect_schema() Schema[source]¶
Resolve the schema of this LazyFrame.
Examples
Determine the schema.
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.collect_schema() Schema({'foo': Int64, 'bar': Float64, 'ham': String})
Access various properties of the schema.
>>> schema = lf.collect_schema() >>> schema["bar"] Float64 >>> schema.names() ['foo', 'bar', 'ham'] >>> schema.dtypes() [Int64, Float64, String] >>> schema.len() 3
- property columns: list[str]¶
Get the column names.
- Returns:
- list of str
A list containing the name of each column in order.
Warning
Determining the column names of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Using
collect_schema()is the idiomatic way of resolving the schema. This property exists only for symmetry with the DataFrame class.See also
collect_schemaSchema.names
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ).select("foo", "bar") >>> lf.columns ['foo', 'bar']
- count() LazyFrame[source]¶
Return the number of non-null elements for each column.
Examples
>>> lf = pl.LazyFrame( ... {"a": [1, 2, 3, 4], "b": [1, 2, 1, None], "c": [None, None, None, None]} ... ) >>> lf.count().collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 │ ╞═════╪═════╪═════╡ │ 4 ┆ 3 ┆ 0 │ └─────┴─────┴─────┘
- describe(percentiles: Sequence[float] | float | None = (0.25, 0.5, 0.75), *, interpolation: QuantileMethod = 'nearest') DataFrame[source]¶
Creates a summary of statistics for a LazyFrame, returning a DataFrame.
- Parameters:
- percentiles
One or more percentiles to include in the summary statistics. All values must be in the range [0, 1].
- interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’, ‘equiprobable’}
Interpolation method used when calculating percentiles.
- Returns:
- DataFrame
Warning
This method does not maintain the laziness of the frame, and will collect the final result. This could potentially be an expensive operation.
We do not guarantee the output of describe to be stable. It will show statistics that we deem informative, and may be updated in the future. Using describe programmatically (versus interactive exploration) is not recommended for this reason.
Notes
The median is included by default as the 50% percentile.
Examples
>>> from datetime import date, time >>> lf = pl.LazyFrame( ... { ... "float": [1.0, 2.8, 3.0], ... "int": [40, 50, None], ... "bool": [True, False, True], ... "str": ["zz", "xx", "yy"], ... "date": [date(2020, 1, 1), date(2021, 7, 5), date(2022, 12, 31)], ... "time": [time(10, 20, 30), time(14, 45, 50), time(23, 15, 10)], ... } ... )
Show default frame statistics:
>>> lf.describe() shape: (9, 7) ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────┬──────────┐ │ statistic ┆ float ┆ int ┆ bool ┆ str ┆ date ┆ time │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str ┆ str │ ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════╪══════════╡ │ count ┆ 3.0 ┆ 2.0 ┆ 3.0 ┆ 3 ┆ 3 ┆ 3 │ │ null_count ┆ 0.0 ┆ 1.0 ┆ 0.0 ┆ 0 ┆ 0 ┆ 0 │ │ mean ┆ 2.266667 ┆ 45.0 ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 ┆ 16:07:10 │ │ std ┆ 1.101514 ┆ 7.071068 ┆ null ┆ null ┆ null ┆ null │ │ min ┆ 1.0 ┆ 40.0 ┆ 0.0 ┆ xx ┆ 2020-01-01 ┆ 10:20:30 │ │ 25% ┆ 2.8 ┆ 40.0 ┆ null ┆ null ┆ 2021-07-05 ┆ 14:45:50 │ │ 50% ┆ 2.8 ┆ 50.0 ┆ null ┆ null ┆ 2021-07-05 ┆ 14:45:50 │ │ 75% ┆ 3.0 ┆ 50.0 ┆ null ┆ null ┆ 2022-12-31 ┆ 23:15:10 │ │ max ┆ 3.0 ┆ 50.0 ┆ 1.0 ┆ zz ┆ 2022-12-31 ┆ 23:15:10 │ └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────┴──────────┘
Customize which percentiles are displayed, applying linear interpolation:
>>> with pl.Config(tbl_rows=12): ... lf.describe( ... percentiles=[0.1, 0.3, 0.5, 0.7, 0.9], ... interpolation="linear", ... ) shape: (11, 7) ┌────────────┬──────────┬──────────┬──────────┬──────┬─────────────────────┬──────────┐ │ statistic ┆ float ┆ int ┆ bool ┆ str ┆ date ┆ time │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str ┆ str │ ╞════════════╪══════════╪══════════╪══════════╪══════╪═════════════════════╪══════════╡ │ count ┆ 3.0 ┆ 2.0 ┆ 3.0 ┆ 3 ┆ 3 ┆ 3 │ │ null_count ┆ 0.0 ┆ 1.0 ┆ 0.0 ┆ 0 ┆ 0 ┆ 0 │ │ mean ┆ 2.266667 ┆ 45.0 ┆ 0.666667 ┆ null ┆ 2021-07-02 16:00:00 ┆ 16:07:10 │ │ std ┆ 1.101514 ┆ 7.071068 ┆ null ┆ null ┆ null ┆ null │ │ min ┆ 1.0 ┆ 40.0 ┆ 0.0 ┆ xx ┆ 2020-01-01 ┆ 10:20:30 │ │ 10% ┆ 1.36 ┆ 41.0 ┆ null ┆ null ┆ 2020-04-20 ┆ 11:13:34 │ │ 30% ┆ 2.08 ┆ 43.0 ┆ null ┆ null ┆ 2020-11-26 ┆ 12:59:42 │ │ 50% ┆ 2.8 ┆ 45.0 ┆ null ┆ null ┆ 2021-07-05 ┆ 14:45:50 │ │ 70% ┆ 2.88 ┆ 47.0 ┆ null ┆ null ┆ 2022-02-07 ┆ 18:09:34 │ │ 90% ┆ 2.96 ┆ 49.0 ┆ null ┆ null ┆ 2022-09-13 ┆ 21:33:18 │ │ max ┆ 3.0 ┆ 50.0 ┆ 1.0 ┆ zz ┆ 2022-12-31 ┆ 23:15:10 │ └────────────┴──────────┴──────────┴──────────┴──────┴─────────────────────┴──────────┘
- classmethod deserialize(source: str | Path | IOBase, *, format: SerializationFormat = 'binary') LazyFrame[source]¶
Read a logical plan from a file to construct a LazyFrame.
- Parameters:
- source
Path to a file or a file-like object (by file-like object, we refer to objects that have a read() method, such as a file handler (e.g. via builtin open function) or BytesIO).
- format
The format with which the LazyFrame was serialized. Options:
“binary”: Deserialize from binary format (bytes). This is the default.
“json”: Deserialize from JSON format (string).
Warning
This function uses
pickleif the logical plan contains Python UDFs, and as such inherits the security implications. Deserializing can execute arbitrary code, so it should only be attempted on trusted data.See also
Notes
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Examples
>>> import io >>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum() >>> bytes = lf.serialize() >>> pl.LazyFrame.deserialize(io.BytesIO(bytes)).collect() shape: (1, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 6 │ └─────┘
- drop(*columns: ColumnNameOrSelector | Iterable[ColumnNameOrSelector], strict: bool = True) LazyFrame[source]¶
Remove columns from the DataFrame.
- Parameters:
- *columns
Names of the columns that should be removed from the dataframe. Accepts column selector input.
- strict
Validate that all column names exist in the current schema, and throw an exception if any do not.
Examples
Drop a single column by passing the name of that column.
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.drop("ham").collect() shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 6.0 │ │ 2 ┆ 7.0 │ │ 3 ┆ 8.0 │ └─────┴─────┘
Drop multiple columns by passing a selector.
>>> import polars.selectors as cs >>> lf.drop(cs.numeric()).collect() shape: (3, 1) ┌─────┐ │ ham │ │ --- │ │ str │ ╞═════╡ │ a │ │ b │ │ c │ └─────┘
Use positional arguments to drop multiple columns.
>>> lf.drop("foo", "ham").collect() shape: (3, 1) ┌─────┐ │ bar │ │ --- │ │ f64 │ ╞═════╡ │ 6.0 │ │ 7.0 │ │ 8.0 │ └─────┘
- drop_nans(subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None) LazyFrame[source]¶
Drop all rows that contain one or more NaN values.
The original order of the remaining rows is preserved.
- Parameters:
- subset
Column name(s) for which NaN values are considered; if set to None (default), use all columns (note that only floating-point columns can contain NaNs).
See also
Notes
A NaN value is not the same as a null value. To drop null values, use
drop_nulls().Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [-20.5, float("nan"), 80.0], ... "bar": [float("nan"), 110.0, 25.5], ... "ham": ["xxx", "yyy", None], ... } ... )
The default behavior of this method is to drop rows where any single value in the row is NaN:
>>> lf.drop_nans().collect() shape: (1, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════╪══════╪══════╡ │ 80.0 ┆ 25.5 ┆ null │ └──────┴──────┴──────┘
This behaviour can be constrained to consider only a subset of columns, as defined by name, or with a selector. For example, dropping rows only if there is a NaN in the “bar” column:
>>> lf.drop_nans(subset=["bar"]).collect() shape: (2, 3) ┌──────┬───────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════╪═══════╪══════╡ │ NaN ┆ 110.0 ┆ yyy │ │ 80.0 ┆ 25.5 ┆ null │ └──────┴───────┴──────┘
Dropping a row only if all values are NaN requires a different formulation:
>>> lf = pl.LazyFrame( ... { ... "a": [float("nan"), float("nan"), float("nan"), float("nan")], ... "b": [10.0, 2.5, float("nan"), 5.25], ... "c": [65.75, float("nan"), float("nan"), 10.5], ... } ... ) >>> lf.filter(~pl.all_horizontal(pl.all().is_nan())).collect() shape: (3, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞═════╪══════╪═══════╡ │ NaN ┆ 10.0 ┆ 65.75 │ │ NaN ┆ 2.5 ┆ NaN │ │ NaN ┆ 5.25 ┆ 10.5 │ └─────┴──────┴───────┘
- drop_nulls(subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None) LazyFrame[source]¶
Drop all rows that contain one or more null values.
The original order of the remaining rows is preserved.
See also
Notes
A null value is not the same as a NaN value. To drop NaN values, use
drop_nans().Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, None, 8], ... "ham": ["a", "b", None], ... } ... )
The default behavior of this method is to drop rows where any single value in the row is null:
>>> lf.drop_nulls().collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
This behaviour can be constrained to consider only a subset of columns, as defined by name or with a selector. For example, dropping rows if there is a null in any of the integer columns:
>>> import polars.selectors as cs >>> lf.drop_nulls(subset=cs.integer()).collect() shape: (2, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪══════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ null │ └─────┴─────┴──────┘
Dropping a row only if all values are null requires a different formulation:
>>> lf = pl.LazyFrame( ... { ... "a": [None, None, None, None], ... "b": [1, 2, None, 1], ... "c": [1, None, None, 1], ... } ... ) >>> lf.filter(~pl.all_horizontal(pl.all().is_null())).collect() shape: (3, 3) ┌──────┬─────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ null ┆ i64 ┆ i64 │ ╞══════╪═════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴─────┴──────┘
- property dtypes: list[DataType]¶
Get the column data types.
- Returns:
- list of DataType
A list containing the data type of each column in order.
Warning
Determining the data types of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Using
collect_schema()is the idiomatic way to resolve the schema. This property exists only for symmetry with the DataFrame class.See also
collect_schemaSchema.dtypes
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.dtypes [Int64, Float64, String]
- explain(*, format: ExplainFormat = 'plain', optimized: bool = True, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, slice_pushdown: bool = True, comm_subplan_elim: bool = True, comm_subexpr_elim: bool = True, cluster_with_columns: bool = True, collapse_joins: bool = True, streaming: bool = False, engine: EngineType = 'auto', tree_format: bool | None = None, optimizations: QueryOptFlags = <polars.lazyframe.opt_flags.QueryOptFlags object>) str[source]¶
Create a string representation of the query plan.
Different optimizations can be turned on or off.
- Parameters:
- format{‘plain’, ‘tree’}
The format to use for displaying the logical plan.
- optimized
Return an optimized query plan. Defaults to True. If this is set to True the subsequent optimization flags control which optimizations run.
- type_coercion
Do type coercion optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- predicate_pushdown
Do predicate pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- projection_pushdown
Do projection pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- simplify_expression
Run simplify expressions optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- slice_pushdown
Slice pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
Deprecated since version 1.30.0: Use the optimizations parameters.
- comm_subexpr_elim
Common subexpressions will be cached and reused.
Deprecated since version 1.30.0: Use the optimizations parameters.
- cluster_with_columns
Combine sequential independent calls to with_columns
Deprecated since version 1.30.0: Use the optimizations parameters.
- collapse_joins
Collapse a join and filters into a faster join
Deprecated since version 1.30.0: Use the optimizations parameters.
- engine
Select the engine used to process the query, optional. At the moment, if set to “auto” (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine. If set to “gpu”, the GPU engine is used. Fine-grained control over the GPU engine, for example which device to use on a system with multiple devices, is possible by providing a
GPUEngineobject with configuration options.Note
GPU mode is considered unstable. Not all queries will run successfully on the GPU, however, they should fall back transparently to the default engine if execution is not supported.
Running with POLARS_VERBOSE=1 will provide information if a query falls back (and why).
Note
The GPU engine does not support streaming, if streaming is enabled then GPU execution is switched off.
- optimizations
The optimization passes done during query optimization.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- tree_format
Format the output as a tree.
Deprecated since version 0.20.30: Use format=”tree” instead.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort( ... "a" ... ).explain()
- explode(columns: ColumnNameOrSelector | Iterable[ColumnNameOrSelector], *more_columns: ColumnNameOrSelector) LazyFrame[source]¶
Explode the DataFrame to long format by exploding the given columns.
- Parameters:
- columns
Column names, expressions, or a selector defining them. The underlying columns being exploded must be of the List or Array data type.
- *more_columns
Additional names of columns to explode, specified as positional arguments.
Examples
>>> lf = pl.LazyFrame( ... { ... "letters": ["a", "a", "b", "c"], ... "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]], ... } ... ) >>> lf.explode("numbers").collect() shape: (8, 2) ┌─────────┬─────────┐ │ letters ┆ numbers │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════════╪═════════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ a ┆ 3 │ │ b ┆ 4 │ │ b ┆ 5 │ │ c ┆ 6 │ │ c ┆ 7 │ │ c ┆ 8 │ └─────────┴─────────┘
- fetch(n_rows: int = 500, **kwargs: Any) DataFrame[source]¶
Collect a small number of rows for debugging purposes.
Warning
This is strictly a utility function that can help to debug queries using a smaller number of rows, and should not be used in production code.
Notes
This is similar to a
collect()operation, but it overwrites the number of rows read by every scan operation. Be aware that fetch does not guarantee the final number of rows in the DataFrame. Filters, join operations and fewer rows being available in the scanned data will all influence the final number of rows (joins are especially susceptible to this, and may return no data at all if n_rows is too small as the join keys may not be present).
- fill_nan(value: int | float | Expr | None) LazyFrame[source]¶
Fill floating point NaN values.
- Parameters:
- value
Value used to fill NaN values.
See also
Notes
A NaN value is not the same as a null value. To fill null values, use
fill_null().Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1.5, 2, float("nan"), 4], ... "b": [0.5, 4, float("nan"), 13], ... } ... ) >>> lf.fill_nan(99).collect() shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════╪══════╡ │ 1.5 ┆ 0.5 │ │ 2.0 ┆ 4.0 │ │ 99.0 ┆ 99.0 │ │ 4.0 ┆ 13.0 │ └──────┴──────┘
- fill_null(value: Any | Expr | None = None, strategy: FillNullStrategy | None = None, limit: int | None = None, *, matches_supertype: bool = True) LazyFrame[source]¶
Fill null values using the specified value or strategy.
- Parameters:
- value
Value used to fill null values.
- strategy{None, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’}
Strategy used to fill null values.
- limit
Number of consecutive null values to fill when using the ‘forward’ or ‘backward’ strategy.
- matches_supertype
Fill all matching supertypes of the fill value literal.
See also
Notes
A null value is not the same as a NaN value. To fill NaN values, use
fill_nan().Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, None, 4], ... "b": [0.5, 4, None, 13], ... } ... ) >>> lf.fill_null(99).collect() shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 99 ┆ 99.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘ >>> lf.fill_null(strategy="forward").collect() shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
>>> lf.fill_null(strategy="max").collect() shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
>>> lf.fill_null(strategy="zero").collect() shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 0 ┆ 0.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
- filter(*predicates: IntoExprColumn | Iterable[IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any], **constraints: Any) LazyFrame[source]¶
Filter rows in the LazyFrame based on a predicate expression.
The original order of the remaining rows is preserved.
Rows where the filter predicate does not evaluate to True are discarded (this includes rows where the predicate evaluates as null).
- Parameters:
- predicates
Expression that evaluates to a boolean Series.
- constraints
Column filters; use name = value to filter columns using the supplied value. Each constraint behaves the same as pl.col(name).eq(value), and is implicitly joined with the other filter conditions using &.
See also
Notes
If you are transitioning from Pandas, and performing filter operations based on the comparison of two or more columns, please note that in Polars any comparison involving null values will result in a null result, not boolean True or False. As a result, these rows will not be retained. Ensure that null values are handled appropriately to avoid unexpected behaviour (see examples below).
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3, None, 4, None, 0], ... "bar": [6, 7, 8, None, None, 9, 0], ... "ham": ["a", "b", "c", None, "d", "e", "f"], ... } ... )
Filter on one condition:
>>> lf.filter(pl.col("foo") > 1).collect() shape: (3, 3) ┌─────┬──────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪══════╪═════╡ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ │ 4 ┆ null ┆ d │ └─────┴──────┴─────┘
Filter on multiple conditions:
>>> lf.filter((pl.col("foo") < 3) & (pl.col("ham") == "a")).collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
Provide multiple filters using *args syntax:
>>> lf.filter( ... pl.col("foo") == 1, ... pl.col("ham") == "a", ... ).collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
Provide multiple filters using **kwargs syntax:
>>> lf.filter(foo=1, ham="a").collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
Filter on an OR condition:
>>> lf.filter( ... (pl.col("foo") == 1) | (pl.col("ham") == "c"), ... ).collect() shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
Filter by comparing two columns against each other
>>> lf.filter( ... pl.col("foo") == pl.col("bar"), ... ).collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 0 ┆ 0 ┆ f │ └─────┴─────┴─────┘
>>> lf.filter( ... pl.col("foo") != pl.col("bar"), ... ).collect() shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
Notice how the row with None values is filtered out; using ne_missing ensures that null values compare equal, and we get similar behaviour to Pandas:
>>> lf.filter( ... pl.col("foo").ne_missing(pl.col("bar")), ... ).collect() shape: (5, 3) ┌──────┬──────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ │ 4 ┆ null ┆ d │ │ null ┆ 9 ┆ e │ └──────┴──────┴─────┘
- first() LazyFrame[source]¶
Get the first row of the DataFrame.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> lf.first().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 2 │ └─────┴─────┘
- gather_every(n: int, offset: int = 0) LazyFrame[source]¶
Take every nth row in the LazyFrame and return as a new LazyFrame.
- Parameters:
- n
Gather every n-th row.
- offset
Starting index.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [5, 6, 7, 8], ... } ... ) >>> lf.gather_every(2).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 5 │ │ 3 ┆ 7 │ └─────┴─────┘ >>> lf.gather_every(2, offset=1).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 6 │ │ 4 ┆ 8 │ └─────┴─────┘
- group_by(*by: IntoExpr | Iterable[IntoExpr], maintain_order: bool = False, **named_by: IntoExpr) LazyGroupBy[source]¶
Start a group by operation.
- Parameters:
- *by
Column(s) to group by. Accepts expression input. Strings are parsed as column names.
- maintain_order
Ensure that the order of the groups is consistent with the input data. This is slower than a default group by. Setting this to True blocks the possibility to run on the streaming engine.
- **named_by
Additional columns to group by, specified as keyword arguments. The columns will be renamed to the keyword used.
Examples
Group by one column and call agg to compute the grouped sum of another column.
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a").agg(pl.col("b").sum()).collect() shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 2 │ │ b ┆ 5 │ │ c ┆ 3 │ └─────┴─────┘
Set maintain_order=True to ensure the order of the groups is consistent with the input.
>>> lf.group_by("a", maintain_order=True).agg(pl.col("c")).collect() shape: (3, 2) ┌─────┬───────────┐ │ a ┆ c │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════╪═══════════╡ │ a ┆ [5, 3] │ │ b ┆ [4, 2] │ │ c ┆ [1] │ └─────┴───────────┘
Group by multiple columns by passing a list of column names.
>>> lf.group_by(["a", "b"]).agg(pl.max("c")).collect() shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘
Or use positional arguments to group by multiple columns in the same way. Expressions are also accepted.
>>> lf.group_by("a", pl.col("b") // 2).agg( ... pl.col("c").mean() ... ).collect() shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 │ ╞═════╪═════╪═════╡ │ a ┆ 0 ┆ 4.0 │ │ b ┆ 1 ┆ 3.0 │ │ c ┆ 1 ┆ 1.0 │ └─────┴─────┴─────┘
- group_by_dynamic(index_column: IntoExpr, *, every: str | timedelta, period: str | timedelta | None = None, offset: str | timedelta | None = None, include_boundaries: bool = False, closed: ClosedInterval = 'left', label: Label = 'left', group_by: IntoExpr | Iterable[IntoExpr] | None = None, start_by: StartBy = 'window') LazyGroupBy[source]¶
Group based on a time value (or index value of type Int32, Int64).
Time windows are calculated and rows are assigned to windows. Different from a normal group by is that a row can be member of multiple groups. By default, the windows look like:
[start, start + period)
[start + every, start + every + period)
[start + 2*every, start + 2*every + period)
…
where start is determined by start_by, offset, every, and the earliest datapoint. See the start_by argument description for details.
Warning
The index column must be sorted in ascending order. If group_by is passed, then the index column must be sorted in ascending order within each group.
Changed in version 0.20.14: The by parameter was renamed group_by.
- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if group_by is specified, then it must be sorted in ascending order within each group).
In case of a dynamic group by on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- every
interval of the window
- period
length of the window, if None it will equal ‘every’
- offset
offset of the window, does not take effect if start_by is ‘datapoint’. Defaults to zero.
- include_boundaries
Add the lower and upper bound of the window to the “_lower_boundary” and “_upper_boundary” columns. This will impact performance because it’s harder to parallelize
- closed{‘left’, ‘right’, ‘both’, ‘none’}
Define which sides of the temporal interval are closed (inclusive).
- label{‘left’, ‘right’, ‘datapoint’}
Define which label to use for the window:
‘left’: lower boundary of the window
‘right’: upper boundary of the window
‘datapoint’: the first value of the index column in the given window. If you don’t need the label to be at one of the boundaries, choose this option for maximum performance
- group_by
Also group by this column/these columns
- start_by{‘window’, ‘datapoint’, ‘monday’, ‘tuesday’, ‘wednesday’, ‘thursday’, ‘friday’, ‘saturday’, ‘sunday’}
The strategy to determine the start of the first window by.
‘window’: Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.
‘datapoint’: Start from the first encountered data point.
a day of the week (only takes effect if every contains ‘w’):
‘monday’: Start the window on the Monday before the first data point.
‘tuesday’: Start the window on the Tuesday before the first data point.
…
‘sunday’: Start the window on the Sunday before the first data point.
The resulting window is then shifted back until the earliest datapoint is in or in front of it.
- Returns:
- LazyGroupBy
Object you can call .agg on to aggregate by groups, the result of which will be sorted by index_column (but note that if group_by columns are passed, it will only be sorted within each group).
See also
Notes
If you’re coming from pandas, then
# polars df.group_by_dynamic("ts", every="1d").agg(pl.col("value").sum())
is equivalent to
# pandas df.set_index("ts").resample("D")["value"].sum().reset_index()
though note that, unlike pandas, polars doesn’t add extra rows for empty windows. If you need index_column to be evenly spaced, then please combine with
DataFrame.upsample().The every, period and offset arguments are created with the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them (except in every): “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
In case of a group_by_dynamic on an integer column, the windows are defined by:
“1i” # length 1
“10i” # length 10
Examples
>>> from datetime import datetime >>> lf = pl.LazyFrame( ... { ... "time": pl.datetime_range( ... start=datetime(2021, 12, 16), ... end=datetime(2021, 12, 16, 3), ... interval="30m", ... eager=True, ... ), ... "n": range(7), ... } ... ) >>> lf.collect() shape: (7, 2) ┌─────────────────────┬─────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ i64 │ ╞═════════════════════╪═════╡ │ 2021-12-16 00:00:00 ┆ 0 │ │ 2021-12-16 00:30:00 ┆ 1 │ │ 2021-12-16 01:00:00 ┆ 2 │ │ 2021-12-16 01:30:00 ┆ 3 │ │ 2021-12-16 02:00:00 ┆ 4 │ │ 2021-12-16 02:30:00 ┆ 5 │ │ 2021-12-16 03:00:00 ┆ 6 │ └─────────────────────┴─────┘
Group by windows of 1 hour.
>>> lf.group_by_dynamic("time", every="1h", closed="right").agg( ... pl.col("n") ... ).collect() shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-15 23:00:00 ┆ [0] │ │ 2021-12-16 00:00:00 ┆ [1, 2] │ │ 2021-12-16 01:00:00 ┆ [3, 4] │ │ 2021-12-16 02:00:00 ┆ [5, 6] │ └─────────────────────┴───────────┘
The window boundaries can also be added to the aggregation result
>>> lf.group_by_dynamic( ... "time", every="1h", include_boundaries=True, closed="right" ... ).agg(pl.col("n").mean()).collect() shape: (4, 4) ┌─────────────────────┬─────────────────────┬─────────────────────┬─────┐ │ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ f64 │ ╞═════════════════════╪═════════════════════╪═════════════════════╪═════╡ │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 0.0 │ │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 1.5 │ │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 3.5 │ │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 5.5 │ └─────────────────────┴─────────────────────┴─────────────────────┴─────┘
When closed=”left”, the window excludes the right end of interval: [lower_bound, upper_bound)
>>> lf.group_by_dynamic("time", every="1h", closed="left").agg( ... pl.col("n") ... ).collect() shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-16 00:00:00 ┆ [0, 1] │ │ 2021-12-16 01:00:00 ┆ [2, 3] │ │ 2021-12-16 02:00:00 ┆ [4, 5] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘
When closed=”both” the time values at the window boundaries belong to 2 groups.
>>> lf.group_by_dynamic("time", every="1h", closed="both").agg( ... pl.col("n") ... ).collect() shape: (4, 2) ┌─────────────────────┬───────────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ list[i64] │ ╞═════════════════════╪═══════════╡ │ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ 2021-12-16 01:00:00 ┆ [2, 3, 4] │ │ 2021-12-16 02:00:00 ┆ [4, 5, 6] │ │ 2021-12-16 03:00:00 ┆ [6] │ └─────────────────────┴───────────┘
Dynamic group bys can also be combined with grouping on normal keys
>>> lf = lf.with_columns(groups=pl.Series(["a", "a", "a", "b", "b", "a", "a"])) >>> lf.collect() shape: (7, 3) ┌─────────────────────┬─────┬────────┐ │ time ┆ n ┆ groups │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ str │ ╞═════════════════════╪═════╪════════╡ │ 2021-12-16 00:00:00 ┆ 0 ┆ a │ │ 2021-12-16 00:30:00 ┆ 1 ┆ a │ │ 2021-12-16 01:00:00 ┆ 2 ┆ a │ │ 2021-12-16 01:30:00 ┆ 3 ┆ b │ │ 2021-12-16 02:00:00 ┆ 4 ┆ b │ │ 2021-12-16 02:30:00 ┆ 5 ┆ a │ │ 2021-12-16 03:00:00 ┆ 6 ┆ a │ └─────────────────────┴─────┴────────┘ >>> lf.group_by_dynamic( ... "time", ... every="1h", ... closed="both", ... group_by="groups", ... include_boundaries=True, ... ).agg(pl.col("n")).collect() shape: (6, 5) ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────────┐ │ groups ┆ _lower_boundary ┆ _upper_boundary ┆ time ┆ n │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ list[i64] │ ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════════╡ │ a ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ [0, 1, 2] │ │ a ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [2] │ │ a ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [5, 6] │ │ a ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ [6] │ │ b ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ [3, 4] │ │ b ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ [4] │ └────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────────┘
Dynamic group by on an index column
>>> lf = pl.LazyFrame( ... { ... "idx": pl.int_range(0, 6, eager=True), ... "A": ["A", "A", "B", "B", "B", "C"], ... } ... ) >>> lf.group_by_dynamic( ... "idx", ... every="2i", ... period="3i", ... include_boundaries=True, ... closed="right", ... ).agg(pl.col("A").alias("A_agg_list")).collect() shape: (4, 4) ┌─────────────────┬─────────────────┬─────┬─────────────────┐ │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ list[str] │ ╞═════════════════╪═════════════════╪═════╪═════════════════╡ │ -2 ┆ 1 ┆ -2 ┆ ["A", "A"] │ │ 0 ┆ 3 ┆ 0 ┆ ["A", "B", "B"] │ │ 2 ┆ 5 ┆ 2 ┆ ["B", "B", "C"] │ │ 4 ┆ 7 ┆ 4 ┆ ["C"] │ └─────────────────┴─────────────────┴─────┴─────────────────┘
- head(n: int = 5) LazyFrame[source]¶
Get the first n rows.
- Parameters:
- n
Number of rows to return.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [7, 8, 9, 10, 11, 12], ... } ... ) >>> lf.head().collect() shape: (5, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ │ 3 ┆ 9 │ │ 4 ┆ 10 │ │ 5 ┆ 11 │ └─────┴─────┘ >>> lf.head(2).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ └─────┴─────┘
- inspect(fmt: str = '{}') LazyFrame[source]¶
Inspect a node in the computation graph.
Print the value that this node in the computation graph evaluates to and pass on the value.
Examples
>>> lf = pl.LazyFrame({"foo": [1, 1, -2, 3]}) >>> ( ... lf.with_columns(pl.col("foo").cum_sum().alias("bar")) ... .inspect() # print the node before the filter ... .filter(pl.col("bar") == pl.col("foo")) ... ) <LazyFrame at ...>
- interpolate() LazyFrame[source]¶
Interpolate intermediate values. The interpolation method is linear.
Nulls at the beginning and end of the series remain null.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, None, 9, 10], ... "bar": [6, 7, 9, None], ... "baz": [1, None, None, 9], ... } ... ) >>> lf.interpolate().collect() shape: (4, 3) ┌──────┬──────┬──────────┐ │ foo ┆ bar ┆ baz │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪══════════╡ │ 1.0 ┆ 6.0 ┆ 1.0 │ │ 5.0 ┆ 7.0 ┆ 3.666667 │ │ 9.0 ┆ 9.0 ┆ 6.333333 │ │ 10.0 ┆ null ┆ 9.0 │ └──────┴──────┴──────────┘
- join(other: LazyFrame, on: str | Expr | Sequence[str | Expr] | None = None, how: JoinStrategy = 'inner', *, left_on: str | Expr | Sequence[str | Expr] | None = None, right_on: str | Expr | Sequence[str | Expr] | None = None, suffix: str = '_right', validate: JoinValidation = 'm:m', nulls_equal: bool = False, coalesce: bool | None = None, maintain_order: MaintainOrderJoin | None = None, allow_parallel: bool = True, force_parallel: bool = False) LazyFrame[source]¶
Add a join operation to the Logical Plan.
Changed in version 1.24: The join_nulls parameter was renamed nulls_equal.
- Parameters:
- other
Lazy DataFrame to join with.
- on
Name(s) of the join columns in both DataFrames. If set, left_on and right_on should be None. This should not be specified if how=’cross’.
- how{‘inner’,’left’, ‘right’, ‘full’, ‘semi’, ‘anti’, ‘cross’}
Join strategy.
inner
(Default) Returns rows that have matching values in both tables.
left
Returns all rows from the left table, and the matched rows from the right table.
full
Returns all rows when there is a match in either left or right.
cross
Returns the Cartesian product of rows from both tables
semi
Returns rows from the left table that have a match in the right table.
anti
Returns rows from the left table that have no match in the right table.
- left_on
Join column of the left DataFrame.
- right_on
Join column of the right DataFrame.
- suffix
Suffix to append to columns with a duplicate name.
- validate: {‘m:m’, ‘m:1’, ‘1:m’, ‘1:1’}
Checks if join is of specified type.
m:m
(Default) Many-to-many. Does not result in checks.
1:1
One-to-one. Checks if join keys are unique in both left and right datasets.
1:m
One-to-many. Checks if join keys are unique in left dataset.
m:1
Many-to-one. Check if join keys are unique in right dataset.
Note
This is currently not supported by the streaming engine.
- nulls_equal
Join on null values. By default null values will never produce matches.
- coalesce
Coalescing behavior (merging of join columns).
None
(Default) Coalesce unless how=’full’ is specified.
True
Always coalesce join columns.
False
Never coalesce join columns.
Note
Joining on any other expressions than col will turn off coalescing.
- maintain_order{‘none’, ‘left’, ‘right’, ‘left_right’, ‘right_left’}
Which DataFrame row order to preserve, if any. Do not rely on any observed ordering without explicitly setting this parameter, as your code may break in a future release. Not specifying any ordering can improve performance. Supported for inner, left, right and full joins
none
(Default) No specific ordering is desired. The ordering might differ across Polars versions or even between different runs.
left
Preserves the order of the left DataFrame.
right
Preserves the order of the right DataFrame.
left_right
First preserves the order of the left DataFrame, then the right.
right_left
First preserves the order of the right DataFrame, then the left.
- allow_parallel
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
- force_parallel
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> other_lf = pl.LazyFrame( ... { ... "apple": ["x", "y", "z"], ... "ham": ["a", "b", "d"], ... } ... ) >>> lf.join(other_lf, on="ham").collect() shape: (2, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ └─────┴─────┴─────┴───────┘ >>> lf.join(other_lf, on="ham", how="full").collect() shape: (4, 5) ┌──────┬──────┬──────┬───────┬───────────┐ │ foo ┆ bar ┆ ham ┆ apple ┆ ham_right │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str ┆ str │ ╞══════╪══════╪══════╪═══════╪═══════════╡ │ 1 ┆ 6.0 ┆ a ┆ x ┆ a │ │ 2 ┆ 7.0 ┆ b ┆ y ┆ b │ │ null ┆ null ┆ null ┆ z ┆ d │ │ 3 ┆ 8.0 ┆ c ┆ null ┆ null │ └──────┴──────┴──────┴───────┴───────────┘ >>> lf.join(other_lf, on="ham", how="left", coalesce=True).collect() shape: (3, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ │ 3 ┆ 8.0 ┆ c ┆ null │ └─────┴─────┴─────┴───────┘ >>> lf.join(other_lf, on="ham", how="semi").collect() shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ └─────┴─────┴─────┘ >>> lf.join(other_lf, on="ham", how="anti").collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
>>> lf.join(other_lf, how="cross").collect() shape: (9, 5) ┌─────┬─────┬─────┬───────┬───────────┐ │ foo ┆ bar ┆ ham ┆ apple ┆ ham_right │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╪═══════════╡ │ 1 ┆ 6.0 ┆ a ┆ x ┆ a │ │ 1 ┆ 6.0 ┆ a ┆ y ┆ b │ │ 1 ┆ 6.0 ┆ a ┆ z ┆ d │ │ 2 ┆ 7.0 ┆ b ┆ x ┆ a │ │ 2 ┆ 7.0 ┆ b ┆ y ┆ b │ │ 2 ┆ 7.0 ┆ b ┆ z ┆ d │ │ 3 ┆ 8.0 ┆ c ┆ x ┆ a │ │ 3 ┆ 8.0 ┆ c ┆ y ┆ b │ │ 3 ┆ 8.0 ┆ c ┆ z ┆ d │ └─────┴─────┴─────┴───────┴───────────┘
- join_asof(other: LazyFrame, *, left_on: str | None | Expr = None, right_on: str | None | Expr = None, on: str | None | Expr = None, by_left: str | Sequence[str] | None = None, by_right: str | Sequence[str] | None = None, by: str | Sequence[str] | None = None, strategy: AsofJoinStrategy = 'backward', suffix: str = '_right', tolerance: str | int | float | timedelta | None = None, allow_parallel: bool = True, force_parallel: bool = False, coalesce: bool = True, allow_exact_matches: bool = True, check_sortedness: bool = True) LazyFrame[source]¶
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the on key (within each by group, if specified).
For each row in the left DataFrame:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
A “nearest” search selects the last row in the right DataFrame whose value is nearest to the left’s key. String keys are not currently supported for a nearest search.
The default is “backward”.
- Parameters:
- other
Lazy DataFrame to join with.
- left_on
Join column of the left DataFrame.
- right_on
Join column of the right DataFrame.
- on
Join column of both DataFrames. If set, left_on and right_on should be None.
- by
Join on these columns before doing asof join.
- by_left
Join on these columns before doing asof join.
- by_right
Join on these columns before doing asof join.
- strategy{‘backward’, ‘forward’, ‘nearest’}
Join strategy.
- suffix
Suffix to append to columns with a duplicate name.
- tolerance
Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time”, use either a datetime.timedelta object or the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
- allow_parallel
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
- force_parallel
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
- coalesce
Coalescing behavior (merging of on / left_on / right_on columns):
True: -> Always coalesce join columns.
False: -> Never coalesce join columns.
Note that joining on any other expressions than col will turn off coalescing.
- allow_exact_matches
Whether exact matches are valid join predicates.
- If True, allow matching with the same
onvalue (i.e. less-than-or-equal-to / greater-than-or-equal-to)
- If True, allow matching with the same
- If False, don’t match the same
onvalue (i.e., strictly less-than / strictly greater-than).
- If False, don’t match the same
- check_sortedness
Check the sortedness of the asof keys. If the keys are not sorted Polars will error. Currently, sortedness cannot be checked if ‘by’ groups are provided.
Examples
>>> from datetime import date >>> gdp = pl.LazyFrame( ... { ... "date": pl.date_range( ... date(2016, 1, 1), ... date(2020, 1, 1), ... "1y", ... eager=True, ... ), ... "gdp": [4164, 4411, 4566, 4696, 4827], ... } ... ) >>> gdp.collect() shape: (5, 2) ┌────────────┬──────┐ │ date ┆ gdp │ │ --- ┆ --- │ │ date ┆ i64 │ ╞════════════╪══════╡ │ 2016-01-01 ┆ 4164 │ │ 2017-01-01 ┆ 4411 │ │ 2018-01-01 ┆ 4566 │ │ 2019-01-01 ┆ 4696 │ │ 2020-01-01 ┆ 4827 │ └────────────┴──────┘
>>> population = pl.LazyFrame( ... { ... "date": [date(2016, 3, 1), date(2018, 8, 1), date(2019, 1, 1)], ... "population": [82.19, 82.66, 83.12], ... } ... ).sort("date") >>> population.collect() shape: (3, 2) ┌────────────┬────────────┐ │ date ┆ population │ │ --- ┆ --- │ │ date ┆ f64 │ ╞════════════╪════════════╡ │ 2016-03-01 ┆ 82.19 │ │ 2018-08-01 ┆ 82.66 │ │ 2019-01-01 ┆ 83.12 │ └────────────┴────────────┘
Note how the dates don’t quite match. If we join them using join_asof and strategy=’backward’, then each date from population which doesn’t have an exact match is matched with the closest earlier date from gdp:
>>> population.join_asof(gdp, on="date", strategy="backward").collect() shape: (3, 3) ┌────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ i64 │ ╞════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 4164 │ │ 2018-08-01 ┆ 82.66 ┆ 4566 │ │ 2019-01-01 ┆ 83.12 ┆ 4696 │ └────────────┴────────────┴──────┘
Note how:
date 2016-03-01 from population is matched with 2016-01-01 from gdp;
date 2018-08-01 from population is matched with 2018-01-01 from gdp.
You can verify this by passing coalesce=False:
>>> population.join_asof( ... gdp, on="date", strategy="backward", coalesce=False ... ).collect() shape: (3, 4) ┌────────────┬────────────┬────────────┬──────┐ │ date ┆ population ┆ date_right ┆ gdp │ │ --- ┆ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ date ┆ i64 │ ╞════════════╪════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 2016-01-01 ┆ 4164 │ │ 2018-08-01 ┆ 82.66 ┆ 2018-01-01 ┆ 4566 │ │ 2019-01-01 ┆ 83.12 ┆ 2019-01-01 ┆ 4696 │ └────────────┴────────────┴────────────┴──────┘
If we instead use strategy=’forward’, then each date from population which doesn’t have an exact match is matched with the closest later date from gdp:
>>> population.join_asof(gdp, on="date", strategy="forward").collect() shape: (3, 3) ┌────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ i64 │ ╞════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 4411 │ │ 2018-08-01 ┆ 82.66 ┆ 4696 │ │ 2019-01-01 ┆ 83.12 ┆ 4696 │ └────────────┴────────────┴──────┘
Note how:
date 2016-03-01 from population is matched with 2017-01-01 from gdp;
date 2018-08-01 from population is matched with 2019-01-01 from gdp.
Finally, strategy=’nearest’ gives us a mix of the two results above, as each date from population which doesn’t have an exact match is matched with the closest date from gdp, regardless of whether it’s earlier or later:
>>> population.join_asof(gdp, on="date", strategy="nearest").collect() shape: (3, 3) ┌────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ date ┆ f64 ┆ i64 │ ╞════════════╪════════════╪══════╡ │ 2016-03-01 ┆ 82.19 ┆ 4164 │ │ 2018-08-01 ┆ 82.66 ┆ 4696 │ │ 2019-01-01 ┆ 83.12 ┆ 4696 │ └────────────┴────────────┴──────┘
Note how:
date 2016-03-01 from population is matched with 2016-01-01 from gdp;
date 2018-08-01 from population is matched with 2019-01-01 from gdp.
They by argument allows joining on another column first, before the asof join. In this example we join by country first, then asof join by date, as above.
>>> gdp_dates = pl.date_range( # fmt: skip ... date(2016, 1, 1), date(2020, 1, 1), "1y", eager=True ... ) >>> gdp2 = pl.LazyFrame( ... { ... "country": ["Germany"] * 5 + ["Netherlands"] * 5, ... "date": pl.concat([gdp_dates, gdp_dates]), ... "gdp": [4164, 4411, 4566, 4696, 4827, 784, 833, 914, 910, 909], ... } ... ).sort("country", "date") >>> >>> gdp2.collect() shape: (10, 3) ┌─────────────┬────────────┬──────┐ │ country ┆ date ┆ gdp │ │ --- ┆ --- ┆ --- │ │ str ┆ date ┆ i64 │ ╞═════════════╪════════════╪══════╡ │ Germany ┆ 2016-01-01 ┆ 4164 │ │ Germany ┆ 2017-01-01 ┆ 4411 │ │ Germany ┆ 2018-01-01 ┆ 4566 │ │ Germany ┆ 2019-01-01 ┆ 4696 │ │ Germany ┆ 2020-01-01 ┆ 4827 │ │ Netherlands ┆ 2016-01-01 ┆ 784 │ │ Netherlands ┆ 2017-01-01 ┆ 833 │ │ Netherlands ┆ 2018-01-01 ┆ 914 │ │ Netherlands ┆ 2019-01-01 ┆ 910 │ │ Netherlands ┆ 2020-01-01 ┆ 909 │ └─────────────┴────────────┴──────┘ >>> pop2 = pl.LazyFrame( ... { ... "country": ["Germany"] * 3 + ["Netherlands"] * 3, ... "date": [ ... date(2016, 3, 1), ... date(2018, 8, 1), ... date(2019, 1, 1), ... date(2016, 3, 1), ... date(2018, 8, 1), ... date(2019, 1, 1), ... ], ... "population": [82.19, 82.66, 83.12, 17.11, 17.32, 17.40], ... } ... ).sort("country", "date") >>> >>> pop2.collect() shape: (6, 3) ┌─────────────┬────────────┬────────────┐ │ country ┆ date ┆ population │ │ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 │ ╞═════════════╪════════════╪════════════╡ │ Germany ┆ 2016-03-01 ┆ 82.19 │ │ Germany ┆ 2018-08-01 ┆ 82.66 │ │ Germany ┆ 2019-01-01 ┆ 83.12 │ │ Netherlands ┆ 2016-03-01 ┆ 17.11 │ │ Netherlands ┆ 2018-08-01 ┆ 17.32 │ │ Netherlands ┆ 2019-01-01 ┆ 17.4 │ └─────────────┴────────────┴────────────┘ >>> pop2.join_asof(gdp2, by="country", on="date", strategy="nearest").collect() shape: (6, 4) ┌─────────────┬────────────┬────────────┬──────┐ │ country ┆ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ f64 ┆ i64 │ ╞═════════════╪════════════╪════════════╪══════╡ │ Germany ┆ 2016-03-01 ┆ 82.19 ┆ 4164 │ │ Germany ┆ 2018-08-01 ┆ 82.66 ┆ 4696 │ │ Germany ┆ 2019-01-01 ┆ 83.12 ┆ 4696 │ │ Netherlands ┆ 2016-03-01 ┆ 17.11 ┆ 784 │ │ Netherlands ┆ 2018-08-01 ┆ 17.32 ┆ 910 │ │ Netherlands ┆ 2019-01-01 ┆ 17.4 ┆ 910 │ └─────────────┴────────────┴────────────┴──────┘
- join_where(other: LazyFrame, *predicates: Expr | Iterable[Expr], suffix: str = '_right') LazyFrame[source]¶
Perform a join based on one or multiple (in)equality predicates.
This performs an inner join, so only rows where all predicates are true are included in the result, and a row from either DataFrame may be included multiple times in the result.
Note
The row order of the input DataFrames is not preserved.
Warning
This functionality is experimental. It may be changed at any point without it being considered a breaking change.
- Parameters:
- other
DataFrame to join with.
- *predicates
(In)Equality condition to join the two tables on. When a column name occurs in both tables, the proper suffix must be applied in the predicate.
- suffix
Suffix to append to columns with a duplicate name.
Examples
Join two lazyframes together based on two predicates which get AND-ed together.
>>> east = pl.LazyFrame( ... { ... "id": [100, 101, 102], ... "dur": [120, 140, 160], ... "rev": [12, 14, 16], ... "cores": [2, 8, 4], ... } ... ) >>> west = pl.LazyFrame( ... { ... "t_id": [404, 498, 676, 742], ... "time": [90, 130, 150, 170], ... "cost": [9, 13, 15, 16], ... "cores": [4, 2, 1, 4], ... } ... ) >>> east.join_where( ... west, ... pl.col("dur") < pl.col("time"), ... pl.col("rev") < pl.col("cost"), ... ).collect() shape: (5, 8) ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐ │ id ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 498 ┆ 130 ┆ 13 ┆ 2 │ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 676 ┆ 150 ┆ 15 ┆ 1 │ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ │ 101 ┆ 140 ┆ 14 ┆ 8 ┆ 676 ┆ 150 ┆ 15 ┆ 1 │ │ 101 ┆ 140 ┆ 14 ┆ 8 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘
To OR them together, use a single expression and the | operator.
>>> east.join_where( ... west, ... (pl.col("dur") < pl.col("time")) | (pl.col("rev") < pl.col("cost")), ... ).collect() shape: (6, 8) ┌─────┬─────┬─────┬───────┬──────┬──────┬──────┬─────────────┐ │ id ┆ dur ┆ rev ┆ cores ┆ t_id ┆ time ┆ cost ┆ cores_right │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═══════╪══════╪══════╪══════╪═════════════╡ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 498 ┆ 130 ┆ 13 ┆ 2 │ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 676 ┆ 150 ┆ 15 ┆ 1 │ │ 100 ┆ 120 ┆ 12 ┆ 2 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ │ 101 ┆ 140 ┆ 14 ┆ 8 ┆ 676 ┆ 150 ┆ 15 ┆ 1 │ │ 101 ┆ 140 ┆ 14 ┆ 8 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ │ 102 ┆ 160 ┆ 16 ┆ 4 ┆ 742 ┆ 170 ┆ 16 ┆ 4 │ └─────┴─────┴─────┴───────┴──────┴──────┴──────┴─────────────┘
- last() LazyFrame[source]¶
Get the last row of the DataFrame.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 5, 3], ... "b": [2, 4, 6], ... } ... ) >>> lf.last().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 6 │ └─────┴─────┘
- lazy = None¶
- limit(n: int = 5) LazyFrame[source]¶
Get the first n rows.
Alias for
LazyFrame.head().- Parameters:
- n
Number of rows to return.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [7, 8, 9, 10, 11, 12], ... } ... ) >>> lf.limit().collect() shape: (5, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ │ 3 ┆ 9 │ │ 4 ┆ 10 │ │ 5 ┆ 11 │ └─────┴─────┘ >>> lf.limit(2).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ └─────┴─────┘
- map_batches(function: Callable[[DataFrame], DataFrame], *, predicate_pushdown: bool = True, projection_pushdown: bool = True, slice_pushdown: bool = True, no_optimizations: bool = False, schema: None | SchemaDict = None, validate_output_schema: bool = True, streamable: bool = False) LazyFrame[source]¶
Apply a custom function.
It is important that the function returns a Polars DataFrame.
- Parameters:
- function
Lambda/ function to apply.
- predicate_pushdown
Allow predicate pushdown optimization to pass this node.
- projection_pushdown
Allow projection pushdown optimization to pass this node.
- slice_pushdown
Allow slice pushdown optimization to pass this node.
- no_optimizations
Turn off all optimizations past this point.
- schema
Output schema of the function, if set to None we assume that the schema will remain unchanged by the applied function.
- validate_output_schema
It is paramount that polars’ schema is correct. This flag will ensure that the output schema of this function will be checked with the expected schema. Setting this to False will not do this check, but may lead to hard to debug bugs.
- streamable
Whether the function that is given is eligible to be running with the streaming engine. That means that the function must produce the same result when it is executed in batches or when it is be executed on the full dataset.
Warning
The schema of a LazyFrame must always be correct. It is up to the caller of this function to ensure that this invariant is upheld.
It is important that the optimization flags are correct. If the custom function for instance does an aggregation of a column, predicate_pushdown should not be allowed, as this prunes rows and will influence your aggregation results.
Notes
A UDF passed to map_batches must be pure, meaning that it cannot modify or depend on state other than its arguments.
Examples
>>> lf = ( ... pl.LazyFrame( ... { ... "a": pl.int_range(-100_000, 0, eager=True), ... "b": pl.int_range(0, 100_000, eager=True), ... } ... ) ... .map_batches(lambda x: 2 * x, streamable=True) ... .collect(engine="streaming") ... ) shape: (100_000, 2) ┌─────────┬────────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════════╪════════╡ │ -200000 ┆ 0 │ │ -199998 ┆ 2 │ │ -199996 ┆ 4 │ │ -199994 ┆ 6 │ │ … ┆ … │ │ -8 ┆ 199992 │ │ -6 ┆ 199994 │ │ -4 ┆ 199996 │ │ -2 ┆ 199998 │ └─────────┴────────┘
- match_to_schema(schema: SchemaDict | Schema, *, missing_columns: Literal['insert', 'raise'] | Mapping[str, Literal['insert', 'raise'] | Expr] = 'raise', missing_struct_fields: Literal['insert', 'raise'] | Mapping[str, Literal['insert', 'raise']] = 'raise', extra_columns: Literal['ignore', 'raise'] = 'raise', extra_struct_fields: Literal['ignore', 'raise'] | Mapping[str, Literal['ignore', 'raise']] = 'raise', integer_cast: Literal['upcast', 'forbid'] | Mapping[str, Literal['upcast', 'forbid']] = 'forbid', float_cast: Literal['upcast', 'forbid'] | Mapping[str, Literal['upcast', 'forbid']] = 'forbid') LazyFrame[source]¶
Match or evolve the schema of a LazyFrame into a specific schema.
By default, match_to_schema returns an error if the input schema does not exactly match the target schema. It also allows columns to be freely reordered, with additional coercion rules available through optional parameters.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- schema
Target schema to match or evolve to.
- missing_columns
Raise of insert missing columns from the input with respect to the schema.
This can also be an expression per column with what to insert if it is missing.
- missing_struct_fields
Raise of insert missing struct fields from the input with respect to the schema.
- extra_columns
Raise of ignore extra columns from the input with respect to the schema.
- extra_struct_fields
Raise of ignore extra struct fields from the input with respect to the schema.
- integer_cast
Forbid of upcast for integer columns from the input to the respective column in schema.
- float_cast
Forbid of upcast for float columns from the input to the respective column in schema.
Examples
Ensuring the schema matches
>>> lf = pl.LazyFrame({"a": [1, 2, 3], "b": ["A", "B", "C"]}) >>> lf.match_to_schema({"a": pl.Int64, "b": pl.String}).collect() shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 1 ┆ A │ │ 2 ┆ B │ │ 3 ┆ C │ └─────┴─────┘ >>> (lf.match_to_schema({"a": pl.Int64}).collect()) polars.exceptions.SchemaError: extra columns in `match_to_schema`: "b"
Adding missing columns
>>> ( ... pl.LazyFrame({"a": [1, 2, 3]}) ... .match_to_schema( ... {"a": pl.Int64, "b": pl.String}, ... missing_columns="insert", ... ) ... .collect() ... ) shape: (3, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪══════╡ │ 1 ┆ null │ │ 2 ┆ null │ │ 3 ┆ null │ └─────┴──────┘ >>> ( ... pl.LazyFrame({"a": [1, 2, 3]}) ... .match_to_schema( ... {"a": pl.Int64, "b": pl.String}, ... missing_columns={"b": pl.col.a.cast(pl.String)}, ... ) ... .collect() ... ) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 1 ┆ 1 │ │ 2 ┆ 2 │ │ 3 ┆ 3 │ └─────┴─────┘
Removing extra columns
>>> ( ... pl.LazyFrame({"a": [1, 2, 3], "b": ["A", "B", "C"]}) ... .match_to_schema( ... {"a": pl.Int64}, ... extra_columns="ignore", ... ) ... .collect() ... ) shape: (3, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Upcasting integers and floats
>>> ( ... pl.LazyFrame( ... {"a": [1, 2, 3], "b": [1.0, 2.0, 3.0]}, ... schema={"a": pl.Int32, "b": pl.Float32}, ... ) ... .match_to_schema( ... {"a": pl.Int64, "b": pl.Float64}, ... integer_cast="upcast", ... float_cast="upcast", ... ) ... .collect() ... ) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 1.0 │ │ 2 ┆ 2.0 │ │ 3 ┆ 3.0 │ └─────┴─────┘
- max() LazyFrame[source]¶
Aggregate the columns in the LazyFrame to their maximum value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.max().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 4 ┆ 2 │ └─────┴─────┘
- mean() LazyFrame[source]¶
Aggregate the columns in the LazyFrame to their mean value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.mean().collect() shape: (1, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═════╪══════╡ │ 2.5 ┆ 1.25 │ └─────┴──────┘
- median() LazyFrame[source]¶
Aggregate the columns in the LazyFrame to their median value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.median().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═════╪═════╡ │ 2.5 ┆ 1.0 │ └─────┴─────┘
- melt(id_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, value_vars: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, variable_name: str | None = None, value_name: str | None = None, *, streamable: bool = True) LazyFrame[source]¶
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars) while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis leaving just two non-identifier columns, ‘variable’ and ‘value’.
Deprecated since version 1.0.0: Use the
unpivot()method instead.- Parameters:
- id_vars
Column(s) or selector(s) to use as identifier variables.
- value_vars
Column(s) or selector(s) to use as values variables; if value_vars is empty all columns that are not in id_vars will be used.
- variable_name
Name to give to the variable column. Defaults to “variable”
- value_name
Name to give to the value column. Defaults to “value”
- streamable
Allow this node to run in the streaming engine. If this runs in streaming, the output of the unpivot operation will not have a stable ordering.
- merge_sorted(other: LazyFrame, key: str) LazyFrame[source]¶
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted in ascending order by that key otherwise the output will not make sense.
The schemas of both LazyFrames must be equal.
- Parameters:
- other
Other DataFrame that must be merged
- key
Key that is sorted.
Notes
No guarantee is given over the output row order when the key is equal between the both dataframes.
The key must be sorted in ascending order.
Examples
>>> df0 = pl.LazyFrame( ... {"name": ["steve", "elise", "bob"], "age": [42, 44, 18]} ... ).sort("age") >>> df0.collect() shape: (3, 2) ┌───────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═══════╪═════╡ │ bob ┆ 18 │ │ steve ┆ 42 │ │ elise ┆ 44 │ └───────┴─────┘ >>> df1 = pl.LazyFrame( ... {"name": ["anna", "megan", "steve", "thomas"], "age": [21, 33, 42, 20]} ... ).sort("age") >>> df1.collect() shape: (4, 2) ┌────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═════╡ │ thomas ┆ 20 │ │ anna ┆ 21 │ │ megan ┆ 33 │ │ steve ┆ 42 │ └────────┴─────┘ >>> df0.merge_sorted(df1, key="age").collect() shape: (7, 2) ┌────────┬─────┐ │ name ┆ age │ │ --- ┆ --- │ │ str ┆ i64 │ ╞════════╪═════╡ │ bob ┆ 18 │ │ thomas ┆ 20 │ │ anna ┆ 21 │ │ megan ┆ 33 │ │ steve ┆ 42 │ │ steve ┆ 42 │ │ elise ┆ 44 │ └────────┴─────┘
- min() LazyFrame[source]¶
Aggregate the columns in the LazyFrame to their minimum value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.min().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 1 │ └─────┴─────┘
- null_count() LazyFrame[source]¶
Aggregate the columns in the LazyFrame as the sum of their null value count.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, None, 3], ... "bar": [6, 7, None], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.null_count().collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 │ ╞═════╪═════╪═════╡ │ 1 ┆ 1 ┆ 0 │ └─────┴─────┴─────┘
- pipe(function: ~collections.abc.Callable[[~typing.Concatenate[~dataframely._typing.LazyFrame[~dataframely._typing.S], ~P]], ~dataframely._typing.R], *args: ~typing.~P, **kwargs: ~typing.~P) R[source]¶
Offers a structured way to apply a sequence of user-defined functions (UDFs).
- Parameters:
- function
Callable; will receive the frame as the first parameter, followed by any given args/kwargs.
- *args
Arguments to pass to the UDF.
- **kwargs
Keyword arguments to pass to the UDF.
See also
Examples
>>> def cast_str_to_int(lf: pl.LazyFrame, col_name: str) -> pl.LazyFrame: ... return lf.with_columns(pl.col(col_name).cast(pl.Int64)) >>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": ["10", "20", "30", "40"], ... } ... ) >>> lf.pipe(cast_str_to_int, col_name="b").collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 10 │ │ 2 ┆ 20 │ │ 3 ┆ 30 │ │ 4 ┆ 40 │ └─────┴─────┘
>>> lf = pl.LazyFrame( ... { ... "b": [1, 2], ... "a": [3, 4], ... } ... ) >>> lf.collect() shape: (2, 2) ┌─────┬─────┐ │ b ┆ a │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘ >>> lf.pipe(lambda lf: lf.select(sorted(lf.collect_schema()))).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 1 │ │ 4 ┆ 2 │ └─────┴─────┘
- pipe_with_schema(function: Callable[[LazyFrame, Schema], LazyFrame]) LazyFrame[source]¶
Allows to alter the lazy frame during the plan stage with the resolved schema.
In contrast to pipe, this method does not execute function immediately but only during the plan stage. This allows to use the resolved schema of the input to dynamically alter the lazy frame. This also means that any exceptions raised by function will only be emitted during the plan stage.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- function
Callable; will receive the frame as the first parameter and the resolved schema as the second parameter.
See also
Examples
>>> def cast_to_float_if_necessary( ... lf: pl.LazyFrame, schema: pl.Schema ... ) -> pl.LazyFrame: ... required_casts = [ ... pl.col(name).cast(pl.Float64) ... for name, dtype in schema.items() ... if not dtype.is_float() ... ] ... return lf.with_columns(required_casts) >>> lf = pl.LazyFrame( ... {"a": [1.0, 2.0], "b": ["1.0", "2.5"], "c": [2.0, 3.0]}, ... schema={"a": pl.Float64, "b": pl.String, "c": pl.Float32}, ... ) >>> lf.pipe_with_schema(cast_to_float_if_necessary).collect() shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f32 │ ╞═════╪═════╪═════╡ │ 1.0 ┆ 1.0 ┆ 2.0 │ │ 2.0 ┆ 2.5 ┆ 3.0 │ └─────┴─────┴─────┘
- profile(*, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, no_optimization: bool = False, slice_pushdown: bool = True, comm_subplan_elim: bool = True, comm_subexpr_elim: bool = True, cluster_with_columns: bool = True, collapse_joins: bool = True, show_plot: bool = False, truncate_nodes: int = 0, figsize: tuple[int, int] = (18, 8), engine: EngineType = 'auto', optimizations: QueryOptFlags = <polars.lazyframe.opt_flags.QueryOptFlags object>, **_kwargs: Any) tuple[DataFrame, DataFrame][source]¶
Profile a LazyFrame.
This will run the query and return a tuple containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.
The units of the timings are microseconds.
- Parameters:
- type_coercion
Do type coercion optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- predicate_pushdown
Do predicate pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- projection_pushdown
Do projection pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- simplify_expression
Run simplify expressions optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- no_optimization
Turn off (certain) optimizations.
Deprecated since version 1.30.0: Use the optimizations parameters.
- slice_pushdown
Slice pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
Deprecated since version 1.30.0: Use the optimizations parameters.
- comm_subexpr_elim
Common subexpressions will be cached and reused.
Deprecated since version 1.30.0: Use the optimizations parameters.
- cluster_with_columns
Combine sequential independent calls to with_columns
Deprecated since version 1.30.0: Use the optimizations parameters.
- collapse_joins
Collapse a join and filters into a faster join
Deprecated since version 1.30.0: Use the optimizations parameters.
- show_plot
Show a gantt chart of the profiling result
- truncate_nodes
Truncate the label lengths in the gantt chart to this number of characters.
- figsize
matplotlib figsize of the profiling plot
- engine
Select the engine used to process the query, optional. At the moment, if set to “auto” (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine. If set to “gpu”, the GPU engine is used. Fine-grained control over the GPU engine, for example which device to use on a system with multiple devices, is possible by providing a
GPUEngineobject with configuration options.Note
GPU mode is considered unstable. Not all queries will run successfully on the GPU, however, they should fall back transparently to the default engine if execution is not supported.
Running with POLARS_VERBOSE=1 will provide information if a query falls back (and why).
Note
The GPU engine does not support streaming, if streaming is enabled then GPU execution is switched off.
- optimizations
The optimization passes done during query optimization.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort( ... "a" ... ).profile() (shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 4 ┆ 10 │ │ b ┆ 11 ┆ 10 │ │ c ┆ 6 ┆ 1 │ └─────┴─────┴─────┘, shape: (3, 3) ┌─────────────────────────┬───────┬──────┐ │ node ┆ start ┆ end │ │ --- ┆ --- ┆ --- │ │ str ┆ u64 ┆ u64 │ ╞═════════════════════════╪═══════╪══════╡ │ optimization ┆ 0 ┆ 5 │ │ group_by_partitioned(a) ┆ 5 ┆ 470 │ │ sort(a) ┆ 475 ┆ 1964 │ └─────────────────────────┴───────┴──────┘)
- quantile(quantile: float | Expr, interpolation: QuantileMethod = 'nearest') LazyFrame[source]¶
Aggregate the columns in the LazyFrame to their quantile value.
- Parameters:
- quantile
Quantile between 0.0 and 1.0.
- interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’, ‘equiprobable’}
Interpolation method.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.quantile(0.7).collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞═════╪═════╡ │ 3.0 ┆ 1.0 │ └─────┴─────┘
- remote(context: pc.ComputeContext | None = None, plan_type: pc._typing.PlanTypePreference = 'dot') pc.LazyFrameRemote[source]¶
Run a query remotely on Polars Cloud.
This allows you to run Polars remotely on one or more workers via several strategies for distributed compute.
Read more in the Announcement post
- Parameters:
- context
Compute context in which queries are executed. If none given, it will take the default context.
- plan_type: {‘plain’, ‘dot’}
Whether to give a dot diagram of a plain text version of logical plan.
Examples
Run a query on a cloud instance.
>>> lf = pl.LazyFrame([1, 2, 3]).sum() >>> in_progress = lf.remote().collect() >>> # do some other work >>> in_progress.await_result() shape: (1, 1) ┌──────────┐ │ column_0 │ │ --- │ │ i64 │ ╞══════════╡ │ 6 │ └──────────┘
Run a query distributed.
>>> lf = ( ... pl.scan_parquet("s3://my_bucket/").group_by("key").agg(pl.sum("values")) ... ) >>> in_progress = lf.remote().distributed().collect() >>> in_progress.await_result() shape: (1, 1) ┌──────────┐ │ column_0 │ │ --- │ │ i64 │ ╞══════════╡ │ 6 │ └──────────┘
- remove(*predicates: IntoExprColumn | Iterable[IntoExprColumn] | bool | list[bool] | np.ndarray[Any, Any], **constraints: Any) LazyFrame[source]¶
Remove rows, dropping those that match the given predicate expression(s).
The original order of the remaining rows is preserved.
Rows where the filter predicate does not evaluate to True are retained (this includes rows where the predicate evaluates as null).
- Parameters:
- predicates
Expression that evaluates to a boolean Series.
- constraints
Column filters; use name = value to filter columns using the supplied value. Each constraint behaves the same as pl.col(name).eq(value), and is implicitly joined with the other filter conditions using &.
See also
Notes
If you are transitioning from Pandas, and performing filter operations based on the comparison of two or more columns, please note that in Polars any comparison involving null values will result in a null result, not boolean True or False. As a result, these rows will not be removed. Ensure that null values are handled appropriately to avoid unexpected behaviour (see examples below).
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [2, 3, None, 4, 0], ... "bar": [5, 6, None, None, 0], ... "ham": ["a", "b", None, "c", "d"], ... } ... )
Remove rows matching a condition:
>>> lf.remove( ... pl.col("bar") >= 5, ... ).collect() shape: (3, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ 4 ┆ null ┆ c │ │ 0 ┆ 0 ┆ d │ └──────┴──────┴──────┘
Discard rows based on multiple conditions, combined with and/or operators:
>>> lf.remove( ... (pl.col("foo") >= 0) & (pl.col("bar") >= 0), ... ).collect() shape: (2, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ 4 ┆ null ┆ c │ └──────┴──────┴──────┘
>>> lf.remove( ... (pl.col("foo") >= 0) | (pl.col("bar") >= 0), ... ).collect() shape: (1, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ └──────┴──────┴──────┘
Provide multiple constraints using *args syntax:
>>> lf.remove( ... pl.col("ham").is_not_null(), ... pl.col("bar") >= 0, ... ).collect() shape: (2, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ 4 ┆ null ┆ c │ └──────┴──────┴──────┘
Provide constraints(s) using **kwargs syntax:
>>> lf.remove(foo=0, bar=0).collect() shape: (4, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ 2 ┆ 5 ┆ a │ │ 3 ┆ 6 ┆ b │ │ null ┆ null ┆ null │ │ 4 ┆ null ┆ c │ └──────┴──────┴──────┘
Remove rows by comparing two columns against each other; in this case, we remove rows where the two columns are not equal (using ne_missing to ensure that null values compare equal):
>>> lf.remove( ... pl.col("foo").ne_missing(pl.col("bar")), ... ).collect() shape: (2, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ 0 ┆ 0 ┆ d │ └──────┴──────┴──────┘
- rename(mapping: Mapping[str, str] | Callable[[str], str], *, strict: bool = True) LazyFrame[source]¶
Rename column names.
- Parameters:
- mapping
Key value pairs that map from old name to new name, or a function that takes the old name as input and returns the new name.
- strict
Validate that all column names exist in the current schema, and throw an exception if any do not. (Note that this parameter is a no-op when passing a function to mapping).
Notes
If existing names are swapped (e.g. ‘A’ points to ‘B’ and ‘B’ points to ‘A’), polars will block projection and predicate pushdowns at this node.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.rename({"foo": "apple"}).collect() shape: (3, 3) ┌───────┬─────┬─────┐ │ apple ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └───────┴─────┴─────┘ >>> lf.rename(lambda column_name: "c" + column_name[1:]).collect() shape: (3, 3) ┌─────┬─────┬─────┐ │ coo ┆ car ┆ cam │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
- reverse() LazyFrame[source]¶
Reverse the DataFrame.
Examples
>>> lf = pl.LazyFrame( ... { ... "key": ["a", "b", "c"], ... "val": [1, 2, 3], ... } ... ) >>> lf.reverse().collect() shape: (3, 2) ┌─────┬─────┐ │ key ┆ val │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ c ┆ 3 │ │ b ┆ 2 │ │ a ┆ 1 │ └─────┴─────┘
- rolling(index_column: IntoExpr, *, period: str | timedelta, offset: str | timedelta | None = None, closed: ClosedInterval = 'right', group_by: IntoExpr | Iterable[IntoExpr] | None = None) LazyGroupBy[source]¶
Create rolling groups based on a temporal or integer column.
Different from a group_by_dynamic the windows are now determined by the individual values and are not of constant intervals. For constant intervals use
LazyFrame.group_by_dynamic().If you have a time series <t_0, t_1, …, t_n>, then by default the windows created will be
(t_0 - period, t_0]
(t_1 - period, t_1]
…
(t_n - period, t_n]
whereas if you pass a non-default offset, then the windows will be
(t_0 + offset, t_0 + offset + period]
(t_1 + offset, t_1 + offset + period]
…
(t_n + offset, t_n + offset + period]
The period and offset arguments are created either from a timedelta, or by using the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 calendar day)
1w (1 calendar week)
1mo (1 calendar month)
1q (1 calendar quarter)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
By “calendar day”, we mean the corresponding time on the next day (which may not be 24 hours, due to daylight savings). Similarly for “calendar week”, “calendar month”, “calendar quarter”, and “calendar year”.
Changed in version 0.20.14: The by parameter was renamed group_by.
- Parameters:
- index_column
Column used to group based on the time window. Often of type Date/Datetime. This column must be sorted in ascending order (or, if group_by is specified, then it must be sorted in ascending order within each group).
In case of a rolling group by on indices, dtype needs to be one of {UInt32, UInt64, Int32, Int64}. Note that the first three get temporarily cast to Int64, so if performance matters use an Int64 column.
- period
Length of the window - must be non-negative.
- offset
Offset of the window. Default is -period.
- closed{‘right’, ‘left’, ‘both’, ‘none’}
Define which sides of the temporal interval are closed (inclusive).
- group_by
Also group by this column/these columns
- Returns:
- LazyGroupBy
Object you can call .agg on to aggregate by groups, the result of which will be sorted by index_column (but note that if group_by columns are passed, it will only be sorted within each group).
See also
Examples
>>> dates = [ ... "2020-01-01 13:45:48", ... "2020-01-01 16:42:13", ... "2020-01-01 16:45:09", ... "2020-01-02 18:12:48", ... "2020-01-03 19:45:32", ... "2020-01-08 23:16:43", ... ] >>> df = pl.LazyFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_columns( ... pl.col("dt").str.strptime(pl.Datetime).set_sorted() ... ) >>> out = ( ... df.rolling(index_column="dt", period="2d") ... .agg( ... pl.sum("a").alias("sum_a"), ... pl.min("a").alias("min_a"), ... pl.max("a").alias("max_a"), ... ) ... .collect() ... ) >>> out shape: (6, 4) ┌─────────────────────┬───────┬───────┬───────┐ │ dt ┆ sum_a ┆ min_a ┆ max_a │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │ ╞═════════════════════╪═══════╪═══════╪═══════╡ │ 2020-01-01 13:45:48 ┆ 3 ┆ 3 ┆ 3 │ │ 2020-01-01 16:42:13 ┆ 10 ┆ 3 ┆ 7 │ │ 2020-01-01 16:45:09 ┆ 15 ┆ 3 ┆ 7 │ │ 2020-01-02 18:12:48 ┆ 24 ┆ 3 ┆ 9 │ │ 2020-01-03 19:45:32 ┆ 11 ┆ 2 ┆ 9 │ │ 2020-01-08 23:16:43 ┆ 1 ┆ 1 ┆ 1 │ └─────────────────────┴───────┴───────┴───────┘
- property schema: Schema¶
Get an ordered mapping of column names to their data type.
Warning
Resolving the schema of a LazyFrame is a potentially expensive operation. Using
collect_schema()is the idiomatic way to resolve the schema. This property exists only for symmetry with the DataFrame class.See also
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.schema Schema({'foo': Int64, 'bar': Float64, 'ham': String})
- select(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) LazyFrame[source]¶
Select columns from this LazyFrame.
- Parameters:
- *exprs
Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
Examples
Pass the name of a column to select that column.
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> lf.select("foo").collect() shape: (3, 1) ┌─────┐ │ foo │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Multiple columns can be selected by passing a list of column names.
>>> lf.select(["foo", "bar"]).collect() shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 6 │ │ 2 ┆ 7 │ │ 3 ┆ 8 │ └─────┴─────┘
Multiple columns can also be selected using positional arguments instead of a list. Expressions are also accepted.
>>> lf.select(pl.col("foo"), pl.col("bar") + 1).collect() shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ │ 3 ┆ 9 │ └─────┴─────┘
Use keyword arguments to easily name your expression inputs.
>>> lf.select( ... threshold=pl.when(pl.col("foo") > 2).then(10).otherwise(0) ... ).collect() shape: (3, 1) ┌───────────┐ │ threshold │ │ --- │ │ i32 │ ╞═══════════╡ │ 0 │ │ 0 │ │ 10 │ └───────────┘
- select_seq(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) LazyFrame[source]¶
Select columns from this LazyFrame.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
- Parameters:
- *exprs
Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
See also
- serialize(file: IOBase | str | Path | None = None, *, format: SerializationFormat = 'binary') bytes | str | None[source]¶
Serialize the logical plan of this LazyFrame to a file or string in JSON format.
- Parameters:
- file
File path to which the result should be written. If set to None (default), the output is returned as a string instead.
- format
The format in which to serialize. Options:
“binary”: Serialize to binary format (bytes). This is the default.
“json”: Serialize to JSON format (string) (deprecated).
See also
Notes
Serialization is not stable across Polars versions: a LazyFrame serialized in one Polars version may not be deserializable in another Polars version.
Examples
Serialize the logical plan into a binary representation.
>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum() >>> bytes = lf.serialize()
The bytes can later be deserialized back into a LazyFrame.
>>> import io >>> pl.LazyFrame.deserialize(io.BytesIO(bytes)).collect() shape: (1, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 6 │ └─────┘
- set_sorted = None¶
- shift(n: int | IntoExprColumn = 1, *, fill_value: IntoExpr | None = None) LazyFrame[source]¶
Shift values by the given number of indices.
- Parameters:
- n
Number of indices to shift forward. If a negative value is passed, values are shifted in the opposite direction instead.
- fill_value
Fill the resulting null values with this value. Accepts scalar expression input. Non-expression inputs are parsed as literals.
Notes
This method is similar to the LAG operation in SQL when the value for n is positive. With a negative value for n, it is similar to LEAD.
Examples
By default, values are shifted forward by one index.
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [5, 6, 7, 8], ... } ... ) >>> lf.shift().collect() shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ null ┆ null │ │ 1 ┆ 5 │ │ 2 ┆ 6 │ │ 3 ┆ 7 │ └──────┴──────┘
Pass a negative value to shift in the opposite direction instead.
>>> lf.shift(-2).collect() shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 3 ┆ 7 │ │ 4 ┆ 8 │ │ null ┆ null │ │ null ┆ null │ └──────┴──────┘
Specify fill_value to fill the resulting null values.
>>> lf.shift(-2, fill_value=100).collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 7 │ │ 4 ┆ 8 │ │ 100 ┆ 100 │ │ 100 ┆ 100 │ └─────┴─────┘
- show_graph(*, optimized: bool = True, show: bool = True, output_path: str | Path | None = None, raw_output: bool = False, figsize: tuple[float, float] = (16.0, 12.0), type_coercion: bool = True, _type_check: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, slice_pushdown: bool = True, comm_subplan_elim: bool = True, comm_subexpr_elim: bool = True, cluster_with_columns: bool = True, collapse_joins: bool = True, engine: EngineType = 'auto', plan_stage: PlanStage = 'ir', _check_order: bool = True, optimizations: QueryOptFlags = <polars.lazyframe.opt_flags.QueryOptFlags object>) str | None[source]¶
Show a plot of the query plan.
Note that Graphviz must be installed to render the visualization (if not already present, you can download it here: https://graphviz.org/download).
- Parameters:
- optimized
Optimize the query plan.
- show
Show the figure.
- output_path
Write the figure to disk.
- raw_output
Return dot syntax. This cannot be combined with show and/or output_path.
- figsize
Passed to matplotlib if show == True.
- type_coercion
Do type coercion optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- predicate_pushdown
Do predicate pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- projection_pushdown
Do projection pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- simplify_expression
Run simplify expressions optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- slice_pushdown
Slice pushdown optimization.
Deprecated since version 1.30.0: Use the optimizations parameters.
- comm_subplan_elim
Will try to cache branching subplans that occur on self-joins or unions.
Deprecated since version 1.30.0: Use the optimizations parameters.
- comm_subexpr_elim
Common subexpressions will be cached and reused.
Deprecated since version 1.30.0: Use the optimizations parameters.
- cluster_with_columns
Combine sequential independent calls to with_columns.
Deprecated since version 1.30.0: Use the optimizations parameters.
- collapse_joins
Collapse a join and filters into a faster join.
Deprecated since version 1.30.0: Use the optimizations parameters.
- engine
Select the engine used to process the query, optional. At the moment, if set to “auto” (default), the query is run using the polars in-memory engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars in-memory engine. If set to “gpu”, the GPU engine is used. Fine-grained control over the GPU engine, for example which device to use on a system with multiple devices, is possible by providing a
GPUEngineobject with configuration options.Note
GPU mode is considered unstable. Not all queries will run successfully on the GPU, however, they should fall back transparently to the default engine if execution is not supported.
Running with POLARS_VERBOSE=1 will provide information if a query falls back (and why).
Note
The GPU engine does not support streaming, if streaming is enabled then GPU execution is switched off.
- plan_stage{‘ir’, ‘physical’}
Select the stage to display. Currently only the streaming engine has a separate physical stage, for the other engines both IR and physical are the same.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [1, 2, 3, 4, 5, 6], ... "c": [6, 5, 4, 3, 2, 1], ... } ... ) >>> lf.group_by("a", maintain_order=True).agg(pl.all().sum()).sort( ... "a" ... ).show_graph()
- sink_csv(path: str | Path | IO[bytes] | IO[str] | PartitioningScheme, *, include_bom: bool = False, include_header: bool = True, separator: str = ', ', line_terminator: str = '\n', quote_char: str = '"', batch_size: int = 1024, datetime_format: str | None = None, date_format: str | None = None, time_format: str | None = None, float_scientific: bool | None = None, float_precision: int | None = None, decimal_comma: bool = False, null_value: str | None = None, quote_style: CsvQuoteStyle | None = None, maintain_order: bool = True, storage_options: dict[str, Any] | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int = 2, sync_on_close: SyncOnCloseMethod | None = None, mkdir: bool = False, lazy: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = <polars.lazyframe.opt_flags.QueryOptFlags object>) LazyFrame | None[source]¶
Evaluate the query in streaming mode and write to a CSV file.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
- path
File path to which the file should be written.
- include_bom
Whether to include UTF-8 BOM in the CSV output.
- include_header
Whether to include header in the CSV output.
- separator
Separate CSV fields with this symbol.
- line_terminator
String used to end each row.
- quote_char
Byte to use as quoting character.
- batch_size
Number of rows that will be processed per thread.
- datetime_format
A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).
- date_format
A format string, with the specifiers defined by the chrono Rust crate.
- time_format
A format string, with the specifiers defined by the chrono Rust crate.
- float_scientific
Whether to use scientific form always (true), never (false), or automatically (None) for Float32 and Float64 datatypes.
- float_precision
Number of decimal places to write, applied to both Float32 and Float64 datatypes.
- decimal_comma
Use a comma as the decimal separator instead of a point. Floats will be encapsulated in quotes if necessary; set the field separator to override.
- null_value
A string representing null values (defaulting to the empty string).
- quote_style{‘necessary’, ‘always’, ‘non_numeric’, ‘never’}
Determines the quoting strategy used.
necessary (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
always: This puts quotes around every field. Always.
never: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
non_numeric: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren`t strictly necessary.
- maintain_order
Maintain the order in which data is processed. Setting this to False will be slightly faster.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (hf://): Accepts an API key under the token parameter: {‘token’: ‘…’}, or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
- sync_on_close: { None, ‘data’, ‘all’ }
Sync to disk when before closing a file.
None does not sync.
data syncs the file contents.
all syncs the file contents and metadata.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- mkdir: bool
Recursively create all the directories in the path.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- lazy: bool
Wait to start execution until collect is called.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- engine
Select the engine used to process the query, optional. At the moment, if set to “auto” (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.
- optimizations
The optimization passes done during query optimization.
This has no effect if lazy is set to True.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Returns:
- DataFrame
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_csv("out.csv")
- sink_ipc(path: str | Path | IO[bytes] | PartitioningScheme, *, compression: IpcCompression | None = 'uncompressed', compat_level: CompatLevel | None = None, maintain_order: bool = True, storage_options: dict[str, Any] | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int = 2, sync_on_close: SyncOnCloseMethod | None = None, mkdir: bool = False, lazy: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = <polars.lazyframe.opt_flags.QueryOptFlags object>) LazyFrame | None[source]¶
Evaluate the query in streaming mode and write to an IPC file.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
- path
File path to which the file should be written.
- compression{‘uncompressed’, ‘lz4’, ‘zstd’}
Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression.
- compat_level
Use a specific compatibility level when exporting Polars’ internal data structures.
- maintain_order
Maintain the order in which data is processed. Setting this to False will be slightly faster.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (hf://): Accepts an API key under the token parameter: {‘token’: ‘…’}, or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
- sync_on_close: { None, ‘data’, ‘all’ }
Sync to disk when before closing a file.
None does not sync.
data syncs the file contents.
all syncs the file contents and metadata.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- mkdir: bool
Recursively create all the directories in the path.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- lazy: bool
Wait to start execution until collect is called.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- engine
Select the engine used to process the query, optional. At the moment, if set to “auto” (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.
Note
The GPU engine is currently not supported.
- optimizations
The optimization passes done during query optimization.
This has no effect if lazy is set to True.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Returns:
- DataFrame
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_ipc("out.arrow")
- sink_ndjson(path: str | Path | IO[bytes] | IO[str] | PartitioningScheme, *, maintain_order: bool = True, storage_options: dict[str, Any] | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int = 2, sync_on_close: SyncOnCloseMethod | None = None, mkdir: bool = False, lazy: bool = False, engine: EngineType = 'auto', optimizations: QueryOptFlags = <polars.lazyframe.opt_flags.QueryOptFlags object>) LazyFrame | None[source]¶
Evaluate the query in streaming mode and write to an NDJSON file.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
- path
File path to which the file should be written.
- maintain_order
Maintain the order in which data is processed. Setting this to False will be slightly faster.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (hf://): Accepts an API key under the token parameter: {‘token’: ‘…’}, or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
- sync_on_close: { None, ‘data’, ‘all’ }
Sync to disk when before closing a file.
None does not sync.
data syncs the file contents.
all syncs the file contents and metadata.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- mkdir: bool
Recursively create all the directories in the path.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- lazy: bool
Wait to start execution until collect is called.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- engine
Select the engine used to process the query, optional. At the moment, if set to “auto” (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.
- optimizations
The optimization passes done during query optimization.
This has no effect if lazy is set to True.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Returns:
- DataFrame
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_ndjson("out.ndjson")
- sink_parquet(path: str | Path | IO[bytes] | PartitioningScheme, *, compression: str = 'zstd', compression_level: int | None = None, statistics: bool | str | dict[str, bool] = True, row_group_size: int | None = None, data_page_size: int | None = None, maintain_order: bool = True, storage_options: dict[str, Any] | None = None, credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto', retries: int = 2, sync_on_close: SyncOnCloseMethod | None = None, metadata: ParquetMetadata | None = None, mkdir: bool = False, lazy: bool = False, field_overwrites: ParquetFieldOverwrites | Sequence[ParquetFieldOverwrites] | Mapping[str, ParquetFieldOverwrites] | None = None, engine: EngineType = 'auto', optimizations: QueryOptFlags = <polars.lazyframe.opt_flags.QueryOptFlags object>) LazyFrame | None[source]¶
Evaluate the query in streaming mode and write to a Parquet file.
This allows streaming results that are larger than RAM to be written to disk.
- Parameters:
- path
File path to which the file should be written.
- compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}
Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.
- compression_level
The level of compression to use. Higher compression means smaller files on disk.
“gzip” : min-level: 0, max-level: 9.
“brotli” : min-level: 0, max-level: 11.
“zstd” : min-level: 1, max-level: 22.
- statistics
Write statistics to the parquet headers. This is the default behavior.
Possible values:
True: enable default set of statistics (default). Some statistics may be disabled.
False: disable all statistics
“full”: calculate and write all available statistics. Cannot be combined with use_pyarrow.
{ “statistic-key”: True / False, … }. Cannot be combined with use_pyarrow. Available keys:
“min”: column minimum value (default: True)
“max”: column maximum value (default: True)
“distinct_count”: number of unique column values (default: False)
“null_count”: number of null values in column (default: True)
- row_group_size
Size of the row groups in number of rows. If None (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.
- data_page_size
Size limit of individual data pages. If not set defaults to 1024 * 1024 bytes
- maintain_order
Maintain the order in which data is processed. Setting this to False will be slightly faster.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (hf://): Accepts an API key under the token parameter: {‘token’: ‘…’}, or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- retries
Number of retries if accessing a cloud instance fails.
- sync_on_close: { None, ‘data’, ‘all’ }
Sync to disk when before closing a file.
None does not sync.
data syncs the file contents.
all syncs the file contents and metadata.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- metadata
A dictionary or callback to add key-values to the file-level Parquet metadata.
Warning
This functionality is considered experimental. It may be removed or changed at any point without it being considered a breaking change.
- mkdir: bool
Recursively create all the directories in the path.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- lazy: bool
Wait to start execution until collect is called.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- field_overwrites
Property overwrites for individual Parquet fields.
This allows more control over the writing process to the granularity of a Parquet field.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- engine
Select the engine used to process the query, optional. At the moment, if set to “auto” (default), the query is run using the polars streaming engine. Polars will also attempt to use the engine set by the POLARS_ENGINE_AFFINITY environment variable. If it cannot run the query using the selected engine, the query is run using the polars streaming engine.
- optimizations
The optimization passes done during query optimization.
This has no effect if lazy is set to True.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Returns:
- DataFrame
Examples
>>> lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv") >>> lf.sink_parquet("out.parquet")
- slice(offset: int, length: int | None = None) LazyFrame[source]¶
Get a slice of this DataFrame.
- Parameters:
- offset
Start index. Negative indexing is supported.
- length
Length of the slice. If set to None, all rows starting at the offset will be selected.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["x", "y", "z"], ... "b": [1, 3, 5], ... "c": [2, 4, 6], ... } ... ) >>> lf.slice(1, 2).collect() shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ y ┆ 3 ┆ 4 │ │ z ┆ 5 ┆ 6 │ └─────┴─────┴─────┘
- sort(by: IntoExpr | Iterable[IntoExpr], *more_by: IntoExpr, descending: bool | Sequence[bool] = False, nulls_last: bool | Sequence[bool] = False, maintain_order: bool = False, multithreaded: bool = True) LazyFrame[source]¶
Sort the LazyFrame by the given columns.
- Parameters:
- by
Column(s) to sort by. Accepts expression input, including selectors. Strings are parsed as column names.
- *more_by
Additional columns to sort by, specified as positional arguments.
- descending
Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans.
- nulls_last
Place null values last; can specify a single boolean applying to all columns or a sequence of booleans for per-column control.
- maintain_order
Whether the order should be maintained if elements are equal. Note that if true streaming is not possible and performance might be worse since this requires a stable search.
- multithreaded
Sort using multiple threads.
Examples
Pass a single column name to sort by that column.
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, None], ... "b": [6.0, 5.0, 4.0], ... "c": ["a", "c", "b"], ... } ... ) >>> lf.sort("a").collect() shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘
Sorting by expressions is also supported.
>>> lf.sort(pl.col("a") + pl.col("b") * 2, nulls_last=True).collect() shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ └──────┴─────┴─────┘
Sort by multiple columns by passing a list of columns.
>>> lf.sort(["c", "a"], descending=True).collect() shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ └──────┴─────┴─────┘
Or use positional arguments to sort by multiple columns in the same way.
>>> lf.sort("c", "a", descending=[False, True]).collect() shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘
- sql(query: str, *, table_name: str = 'self') LazyFrame[source]¶
Execute a SQL query against the LazyFrame.
Added in version 0.20.23.
Warning
This functionality is considered unstable, although it is close to being considered stable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- query
SQL query to execute.
- table_name
Optionally provide an explicit name for the table that represents the calling frame (defaults to “self”).
See also
SQLContext
Notes
The calling frame is automatically registered as a table in the SQL context under the name “self”. If you want access to the DataFrames and LazyFrames found in the current globals, use the top-level
pl.sql.More control over registration and execution behaviour is available by using the
SQLContextobject.
Examples
>>> lf1 = pl.LazyFrame({"a": [1, 2, 3], "b": [6, 7, 8], "c": ["z", "y", "x"]}) >>> lf2 = pl.LazyFrame({"a": [3, 2, 1], "d": [125, -654, 888]})
Query the LazyFrame using SQL:
>>> lf1.sql("SELECT c, b FROM self WHERE a > 1").collect() shape: (2, 2) ┌─────┬─────┐ │ c ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ y ┆ 7 │ │ x ┆ 8 │ └─────┴─────┘
Apply SQL transforms (aliasing “self” to “frame”) then filter natively (you can freely mix SQL and native operations):
>>> lf1.sql( ... query=''' ... SELECT ... a, ... (a % 2 == 0) AS a_is_even, ... (b::float4 / 2) AS "b/2", ... CONCAT_WS(':', c, c, c) AS c_c_c ... FROM frame ... ORDER BY a ... ''', ... table_name="frame", ... ).filter(~pl.col("c_c_c").str.starts_with("x")).collect() shape: (2, 4) ┌─────┬───────────┬─────┬───────┐ │ a ┆ a_is_even ┆ b/2 ┆ c_c_c │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ bool ┆ f32 ┆ str │ ╞═════╪═══════════╪═════╪═══════╡ │ 1 ┆ false ┆ 3.0 ┆ z:z:z │ │ 2 ┆ true ┆ 3.5 ┆ y:y:y │ └─────┴───────────┴─────┴───────┘
- std(ddof: int = 1) LazyFrame[source]¶
Aggregate the columns in the LazyFrame to their standard deviation value.
- Parameters:
- ddof
“Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.std().collect() shape: (1, 2) ┌──────────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════╪═════╡ │ 1.290994 ┆ 0.5 │ └──────────┴─────┘ >>> lf.std(ddof=0).collect() shape: (1, 2) ┌──────────┬──────────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════╪══════════╡ │ 1.118034 ┆ 0.433013 │ └──────────┴──────────┘
- sum() LazyFrame[source]¶
Aggregate the columns in the LazyFrame to their sum value.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.sum().collect() shape: (1, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 10 ┆ 5 │ └─────┴─────┘
- tail(n: int = 5) LazyFrame[source]¶
Get the last n rows.
- Parameters:
- n
Number of rows to return.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4, 5, 6], ... "b": [7, 8, 9, 10, 11, 12], ... } ... ) >>> lf.tail().collect() shape: (5, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 2 ┆ 8 │ │ 3 ┆ 9 │ │ 4 ┆ 10 │ │ 5 ┆ 11 │ │ 6 ┆ 12 │ └─────┴─────┘ >>> lf.tail(2).collect() shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 5 ┆ 11 │ │ 6 ┆ 12 │ └─────┴─────┘
- top_k(k: int, *, by: IntoExpr | Iterable[IntoExpr], reverse: bool | Sequence[bool] = False) LazyFrame[source]¶
Return the k largest rows.
Non-null elements are always preferred over null elements, regardless of the value of reverse. The output is not guaranteed to be in any particular order, call
sort()after this function if you wish the output to be sorted.Changed in version 1.0.0: The descending parameter was renamed reverse.
- Parameters:
- k
Number of rows to return.
- by
Column(s) used to determine the top rows. Accepts expression input. Strings are parsed as column names.
- reverse
Consider the k smallest elements of the by column(s) (instead of the k largest). This can be specified per column by passing a sequence of booleans.
See also
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["a", "b", "a", "b", "b", "c"], ... "b": [2, 1, 1, 3, 2, 1], ... } ... )
Get the rows which contain the 4 largest values in column b.
>>> lf.top_k(4, by="b").collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 3 │ │ a ┆ 2 │ │ b ┆ 2 │ │ b ┆ 1 │ └─────┴─────┘
Get the rows which contain the 4 largest values when sorting on column b and a.
>>> lf.top_k(4, by=["b", "a"]).collect() shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ b ┆ 3 │ │ b ┆ 2 │ │ a ┆ 2 │ │ c ┆ 1 │ └─────┴─────┘
- unique(subset: ColumnNameOrSelector | Collection[ColumnNameOrSelector] | None = None, *, keep: UniqueKeepStrategy = 'any', maintain_order: bool = False) LazyFrame[source]¶
Drop duplicate rows from this DataFrame.
- Parameters:
- subset
Column name(s) or selector(s), to consider when identifying duplicate rows. If set to None (default), use all columns.
- keep{‘first’, ‘last’, ‘any’, ‘none’}
Which of the duplicate rows to keep.
- ‘any’: Does not give any guarantee of which row is kept.
This allows more optimizations.
‘none’: Don’t keep duplicate rows.
‘first’: Keep first unique row.
‘last’: Keep last unique row.
- maintain_order
Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to True blocks the possibility to run on the streaming engine.
- Returns:
- LazyFrame
LazyFrame with unique rows.
Warning
This method will fail if there is a column of type List in the DataFrame or subset.
Notes
If you’re coming from pandas, this is similar to pandas.DataFrame.drop_duplicates.
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3, 1], ... "bar": ["a", "a", "a", "a"], ... "ham": ["b", "b", "b", "b"], ... } ... ) >>> lf.unique(maintain_order=True).collect() shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> lf.unique(subset=["bar", "ham"], maintain_order=True).collect() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> lf.unique(keep="last", maintain_order=True).collect() shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ │ 1 ┆ a ┆ b │ └─────┴─────┴─────┘
- unnest(columns: ColumnNameOrSelector | Collection[ColumnNameOrSelector], *more_columns: ColumnNameOrSelector) LazyFrame[source]¶
Decompose struct columns into separate columns for each of their fields.
The new columns will be inserted into the DataFrame at the location of the struct column.
- Parameters:
- columns
Name of the struct column(s) that should be unnested.
- *more_columns
Additional columns to unnest, specified as positional arguments.
Examples
>>> df = pl.LazyFrame( ... { ... "before": ["foo", "bar"], ... "t_a": [1, 2], ... "t_b": ["a", "b"], ... "t_c": [True, None], ... "t_d": [[1, 2], [3]], ... "after": ["baz", "womp"], ... } ... ).select("before", pl.struct(pl.col("^t_.$")).alias("t_struct"), "after") >>> df.collect() shape: (2, 3) ┌────────┬─────────────────────┬───────┐ │ before ┆ t_struct ┆ after │ │ --- ┆ --- ┆ --- │ │ str ┆ struct[4] ┆ str │ ╞════════╪═════════════════════╪═══════╡ │ foo ┆ {1,"a",true,[1, 2]} ┆ baz │ │ bar ┆ {2,"b",null,[3]} ┆ womp │ └────────┴─────────────────────┴───────┘ >>> df.unnest("t_struct").collect() shape: (2, 6) ┌────────┬─────┬─────┬──────┬───────────┬───────┐ │ before ┆ t_a ┆ t_b ┆ t_c ┆ t_d ┆ after │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str │ ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡ │ foo ┆ 1 ┆ a ┆ true ┆ [1, 2] ┆ baz │ │ bar ┆ 2 ┆ b ┆ null ┆ [3] ┆ womp │ └────────┴─────┴─────┴──────┴───────────┴───────┘
- unpivot(on: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, *, index: ColumnNameOrSelector | Sequence[ColumnNameOrSelector] | None = None, variable_name: str | None = None, value_name: str | None = None, streamable: bool = True) LazyFrame[source]¶
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are “unpivoted” to the row axis leaving just two non-identifier columns, ‘variable’ and ‘value’.
- Parameters:
- on
Column(s) or selector(s) to use as values variables; if on is empty all columns that are not in index will be used.
- index
Column(s) or selector(s) to use as identifier variables.
- variable_name
Name to give to the variable column. Defaults to “variable”
- value_name
Name to give to the value column. Defaults to “value”
- streamable
deprecated
Notes
If you’re coming from pandas, this is similar to pandas.DataFrame.melt, but with index replacing id_vars and on replacing value_vars. In other frameworks, you might know this operation as pivot_longer.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": ["x", "y", "z"], ... "b": [1, 3, 5], ... "c": [2, 4, 6], ... } ... ) >>> import polars.selectors as cs >>> lf.unpivot(cs.numeric(), index="a").collect() shape: (6, 3) ┌─────┬──────────┬───────┐ │ a ┆ variable ┆ value │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞═════╪══════════╪═══════╡ │ x ┆ b ┆ 1 │ │ y ┆ b ┆ 3 │ │ z ┆ b ┆ 5 │ │ x ┆ c ┆ 2 │ │ y ┆ c ┆ 4 │ │ z ┆ c ┆ 6 │ └─────┴──────────┴───────┘
- update(other: LazyFrame, on: str | Sequence[str] | None = None, how: Literal['left', 'inner', 'full'] = 'left', *, left_on: str | Sequence[str] | None = None, right_on: str | Sequence[str] | None = None, include_nulls: bool = False, maintain_order: MaintainOrderJoin | None = 'left') LazyFrame[source]¶
Update the values in this LazyFrame with the values in other.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- Parameters:
- other
LazyFrame that will be used to update the values
- on
Column names that will be joined on. If set to None (default), the implicit row index of each frame is used as a join key.
- how{‘left’, ‘inner’, ‘full’}
‘left’ will keep all rows from the left table; rows may be duplicated if multiple rows in the right frame match the left row’s key.
‘inner’ keeps only those rows where the key exists in both frames.
‘full’ will update existing rows where the key matches while also adding any new rows contained in the given frame.
- left_on
Join column(s) of the left DataFrame.
- right_on
Join column(s) of the right DataFrame.
- include_nulls
Overwrite values in the left frame with null values from the right frame. If set to False (default), null values in the right frame are ignored.
- maintain_order{‘none’, ‘left’, ‘right’, ‘left_right’, ‘right_left’}
Which order of rows from the inputs to preserve. See
join()for details. Unlike join this function preserves the left order by default.
Notes
This is syntactic sugar for a left/inner join that preserves the order of the left DataFrame by default, with an optional coalesce when include_nulls = False.
Examples
>>> lf = pl.LazyFrame( ... { ... "A": [1, 2, 3, 4], ... "B": [400, 500, 600, 700], ... } ... ) >>> lf.collect() shape: (4, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 400 │ │ 2 ┆ 500 │ │ 3 ┆ 600 │ │ 4 ┆ 700 │ └─────┴─────┘ >>> new_lf = pl.LazyFrame( ... { ... "B": [-66, None, -99], ... "C": [5, 3, 1], ... } ... )
Update df values with the non-null values in new_df, by row index:
>>> lf.update(new_lf).collect() shape: (4, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -66 │ │ 2 ┆ 500 │ │ 3 ┆ -99 │ │ 4 ┆ 700 │ └─────┴─────┘
Update df values with the non-null values in new_df, by row index, but only keeping those rows that are common to both frames:
>>> lf.update(new_lf, how="inner").collect() shape: (3, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -66 │ │ 2 ┆ 500 │ │ 3 ┆ -99 │ └─────┴─────┘
Update df values with the non-null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:
>>> lf.update(new_lf, left_on=["A"], right_on=["C"], how="full").collect() shape: (5, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ -99 │ │ 2 ┆ 500 │ │ 3 ┆ 600 │ │ 4 ┆ 700 │ │ 5 ┆ -66 │ └─────┴─────┘
Update df values including null values in new_df, using a full outer join strategy that defines explicit join columns in each frame:
>>> lf.update( ... new_lf, left_on="A", right_on="C", how="full", include_nulls=True ... ).collect() shape: (5, 2) ┌─────┬──────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪══════╡ │ 1 ┆ -99 │ │ 2 ┆ 500 │ │ 3 ┆ null │ │ 4 ┆ 700 │ │ 5 ┆ -66 │ └─────┴──────┘
- var(ddof: int = 1) LazyFrame[source]¶
Aggregate the columns in the LazyFrame to their variance value.
- Parameters:
- ddof
“Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, where N represents the number of elements. By default ddof is 1.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [1, 2, 1, 1], ... } ... ) >>> lf.var().collect() shape: (1, 2) ┌──────────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════╪══════╡ │ 1.666667 ┆ 0.25 │ └──────────┴──────┘ >>> lf.var(ddof=0).collect() shape: (1, 2) ┌──────┬────────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════╪════════╡ │ 1.25 ┆ 0.1875 │ └──────┴────────┘
- property width: int¶
Get the number of columns.
- Returns:
- int
Warning
Determining the width of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Using
collect_schema()is the idiomatic way to resolve the schema. This property exists only for symmetry with the DataFrame class.See also
collect_schemaSchema.len
Examples
>>> lf = pl.LazyFrame( ... { ... "foo": [1, 2, 3], ... "bar": [4, 5, 6], ... } ... ) >>> lf.width 2
- with_columns(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) LazyFrame[source]¶
Add columns to this LazyFrame.
Added columns will replace existing columns with the same name.
- Parameters:
- *exprs
Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
- LazyFrame
A new LazyFrame with the columns added.
Notes
Creating a new LazyFrame using this method does not create a new copy of existing data.
Examples
Pass an expression to add it as a new column.
>>> lf = pl.LazyFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> lf.with_columns((pl.col("a") ** 2).alias("a^2")).collect() shape: (4, 4) ┌─────┬──────┬───────┬─────┐ │ a ┆ b ┆ c ┆ a^2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ i64 │ ╞═════╪══════╪═══════╪═════╡ │ 1 ┆ 0.5 ┆ true ┆ 1 │ │ 2 ┆ 4.0 ┆ true ┆ 4 │ │ 3 ┆ 10.0 ┆ false ┆ 9 │ │ 4 ┆ 13.0 ┆ true ┆ 16 │ └─────┴──────┴───────┴─────┘
Added columns will replace existing columns with the same name.
>>> lf.with_columns(pl.col("a").cast(pl.Float64)).collect() shape: (4, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╡ │ 1.0 ┆ 0.5 ┆ true │ │ 2.0 ┆ 4.0 ┆ true │ │ 3.0 ┆ 10.0 ┆ false │ │ 4.0 ┆ 13.0 ┆ true │ └─────┴──────┴───────┘
Multiple columns can be added using positional arguments.
>>> lf.with_columns( ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").not_()).alias("not c"), ... ).collect() shape: (4, 6) ┌─────┬──────┬───────┬─────┬──────┬───────┐ │ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ i64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪═════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1 ┆ 0.25 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 4 ┆ 2.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 9 ┆ 5.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 16 ┆ 6.5 ┆ false │ └─────┴──────┴───────┴─────┴──────┴───────┘
Multiple columns can also be added by passing a list of expressions.
>>> lf.with_columns( ... [ ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").not_()).alias("not c"), ... ] ... ).collect() shape: (4, 6) ┌─────┬──────┬───────┬─────┬──────┬───────┐ │ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ i64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪═════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1 ┆ 0.25 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 4 ┆ 2.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 9 ┆ 5.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 16 ┆ 6.5 ┆ false │ └─────┴──────┴───────┴─────┴──────┴───────┘
Use keyword arguments to easily name your expression inputs.
>>> lf.with_columns( ... ab=pl.col("a") * pl.col("b"), ... not_c=pl.col("c").not_(), ... ).collect() shape: (4, 5) ┌─────┬──────┬───────┬──────┬───────┐ │ a ┆ b ┆ c ┆ ab ┆ not_c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 0.5 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 8.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 30.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 52.0 ┆ false │ └─────┴──────┴───────┴──────┴───────┘
- with_columns_seq(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr) LazyFrame[source]¶
Add columns to this LazyFrame.
Added columns will replace existing columns with the same name.
This will run all expression sequentially instead of in parallel. Use this when the work per expression is cheap.
- Parameters:
- *exprs
Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- **named_exprs
Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
- LazyFrame
A new LazyFrame with the columns added.
See also
- with_context(other: Self | list[Self]) LazyFrame[source]¶
Add an external context to the computation graph.
Deprecated since version 1.0.0: Use
concat()instead, with how=’horizontal’This allows expressions to also access columns from DataFrames that are not part of this one.
- Parameters:
- other
Lazy DataFrame to join with.
Examples
>>> lf = pl.LazyFrame({"a": [1, 2, 3], "b": ["a", "c", None]}) >>> lf_other = pl.LazyFrame({"c": ["foo", "ham"]}) >>> lf.with_context(lf_other).select( ... pl.col("b") + pl.col("c").first() ... ).collect() shape: (3, 1) ┌──────┐ │ b │ │ --- │ │ str │ ╞══════╡ │ afoo │ │ cfoo │ │ null │ └──────┘
Fill nulls with the median from another DataFrame:
>>> train_lf = pl.LazyFrame( ... {"feature_0": [-1.0, 0, 1], "feature_1": [-1.0, 0, 1]} ... ) >>> test_lf = pl.LazyFrame( ... {"feature_0": [-1.0, None, 1], "feature_1": [-1.0, 0, 1]} ... ) >>> test_lf.with_context( ... train_lf.select(pl.all().name.suffix("_train")) ... ).select( ... pl.col("feature_0").fill_null(pl.col("feature_0_train").median()) ... ).collect() shape: (3, 1) ┌───────────┐ │ feature_0 │ │ --- │ │ f64 │ ╞═══════════╡ │ -1.0 │ │ 0.0 │ │ 1.0 │ └───────────┘
- with_row_count(name: str = 'row_nr', offset: int = 0) LazyFrame[source]¶
Add a column at index 0 that counts the rows.
Deprecated since version 0.20.4: Use the
with_row_index()method instead. Note that the default column name has changed from ‘row_nr’ to ‘index’.- Parameters:
- name
Name of the column to add.
- offset
Start the row count at this offset.
Warning
This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> lf.with_row_count().collect() shape: (3, 3) ┌────────┬─────┬─────┐ │ row_nr ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞════════╪═════╪═════╡ │ 0 ┆ 1 ┆ 2 │ │ 1 ┆ 3 ┆ 4 │ │ 2 ┆ 5 ┆ 6 │ └────────┴─────┴─────┘
- with_row_index(name: str = 'index', offset: int = 0) LazyFrame[source]¶
Add a row index as the first column in the LazyFrame.
- Parameters:
- name
Name of the index column.
- offset
Start the index at this offset. Cannot be negative.
Warning
Using this function can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.
Notes
The resulting column does not have any special properties. It is a regular column of type UInt32 (or UInt64 in polars-u64-idx).
Examples
>>> lf = pl.LazyFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> lf.with_row_index().collect() shape: (3, 3) ┌───────┬─────┬─────┐ │ index ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞═══════╪═════╪═════╡ │ 0 ┆ 1 ┆ 2 │ │ 1 ┆ 3 ┆ 4 │ │ 2 ┆ 5 ┆ 6 │ └───────┴─────┴─────┘ >>> lf.with_row_index("id", offset=1000).collect() shape: (3, 3) ┌──────┬─────┬─────┐ │ id ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞══════╪═════╪═════╡ │ 1000 ┆ 1 ┆ 2 │ │ 1001 ┆ 3 ┆ 4 │ │ 1002 ┆ 5 ┆ 6 │ └──────┴─────┴─────┘
An index column can also be created using the expressions
int_range()andlen().>>> lf.select( ... pl.int_range(pl.len(), dtype=pl.UInt32).alias("index"), ... pl.all(), ... ).collect() shape: (3, 3) ┌───────┬─────┬─────┐ │ index ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞═══════╪═════╪═════╡ │ 0 ┆ 1 ┆ 2 │ │ 1 ┆ 3 ┆ 4 │ │ 2 ┆ 5 ┆ 6 │ └───────┴─────┴─────┘
- class dataframely.List(inner: Column, *, nullable: bool | None = None, primary_key: bool = False, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, min_length: int | None = None, max_length: int | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ColumnA list column.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Object(*, nullable: bool = True, primary_key: bool = False, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ColumnA Python Object column.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Schema[source]¶
Bases:
BaseSchema,ABCBase class for all custom data frame schema definitions.
A custom schema should only define its columns via simple assignment:
class MySchema(Schema): a = dataframely.Int64() b = dataframely.String()
All definitions using non-datatype classes are ignored.
Schemas can also be nested (arbitrarily deeply): in this case, the columns defined in the subclass are simply appended to the columns in the superclass(es).
Methods
cast()Cast a data frame to match the schema.
The column names of this schema.
columns()The column definitions of this schema.
Create an empty data or lazy frame from this schema.
Impute
Noneinput with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.filter(df, /, *[, cast])Filter the data frame by the rules of this schema.
is_valid(df, /, *[, cast])Utility method to check whether
validate()raises an exception.matches(other)Check whether this schema semantically matches another schema.
Obtain the polars schema for this schema.
The primary key columns in this schema (possibly empty).
Obtain the pyarrow schema for this schema.
read_delta(source, *[, validation])Read a Delta Lake table into a typed data frame with this schema.
read_parquet(source, *[, validation])Read a parquet file into a typed data frame with this schema.
sample([num_rows, overrides, generator])Create a random data frame with a predefined number of rows.
scan_delta(source, *[, validation])Lazily read a Delta Lake table into a typed data frame with this schema.
scan_parquet(source, *[, validation])Lazily read a parquet file into a typed data frame with this schema.
Serialize this schema to a JSON string.
sink_parquet(lf, /, file, **kwargs)Stream a typed lazy frame with this schema to a parquet file.
sql_schema(dialect)Obtain the SQL schema for a particular dialect for this schema.
validate(df, /, *[, cast])Validate that a data frame satisfies the schema.
write_delta(df, /, target, **kwargs)Write a typed data frame with this schema to a Delta Lake table.
write_parquet(df, /, file, **kwargs)Write a typed data frame with this schema to a parquet file.
- classmethod cast(df: DataFrame | LazyFrame, /) DataFrame[Self] | LazyFrame[Self][source]¶
Cast a data frame to match the schema.
This method removes superfluous columns and casts all schema columns to the correct dtypes. However, it does not introspect the data frame contents.
Hence, this method should be used with care and
validate()should generally be preferred. It is advised to only use this method ifdfis surely known to adhere to the schema.- Returns:
The input data frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence.
- Note:
If you only require a generic data frame for the type checker, consider using
typing.cast()instead of this method.- Attention:
For lazy frames, casting is not performed eagerly. This prevents collecting the lazy frame’s schema but also means that a call to
collect()further down the line might fail because of the cast and/or missing columns.
- classmethod create_empty(*, lazy: bool = False) DataFrame[Self] | LazyFrame[Self][source]¶
Create an empty data or lazy frame from this schema.
- Args:
- lazy: Whether to create a lazy data frame. If
True, returns a lazy frame with this Schema. Otherwise, returns an eager frame.
- lazy: Whether to create a lazy data frame. If
- Returns:
An instance of
polars.DataFrameorpolars.LazyFramewith this schema’s defined columns and their data types.
- classmethod create_empty_if_none(df: DataFrame[Self] | LazyFrame[Self] | None, *, lazy: bool = False) DataFrame[Self] | LazyFrame[Self][source]¶
Impute
Noneinput with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.- Args:
- df: The data frame to check for
None. If it is notNone, it is returned as lazy or eager frame. Otherwise, a schema-compliant data or lazy frame with no rows is returned.
- lazy: Whether to return a lazy data frame. If
True, returns a lazy frame with this Schema. Otherwise, returns an eager frame.
- df: The data frame to check for
- Returns:
The given data frame
dfas lazy or eager frame, if it is notNone. An instance ofpolars.DataFrameorpolars.LazyFramewith this schema’s defined columns and their data types, but no rows, otherwise.
- classmethod filter(df: DataFrame | LazyFrame, /, *, cast: bool = False) tuple[DataFrame[Self], FailureInfo[Self]][source]¶
Filter the data frame by the rules of this schema.
This method can be thought of as a “soft alternative” to
validate(). Whilevalidate()raises an exception when a row does not adhere to the rules defined in the schema, this method simply filters out these rows and succeeds.- Args:
- df: The data frame to filter for valid rows. The data frame is collected
within this method, regardless of whether a
DataFrameorLazyFrameis passed.- cast: Whether columns with a wrong data type in the input data frame are
cast to the schema’s defined data type if possible. Rows for which the cast fails for any column are filtered out.
- Returns:
A tuple of the validated rows in the input data frame (potentially empty) and a simple dataclass carrying information about the rows of the data frame which could not be validated successfully. Just like in polars’ native
filter(), the order of rows in the returned data frame is maintained.- Raises:
- ValidationError: If the columns of the input data frame are invalid. This
happens only if the data frame misses a column defined in the schema or a column has an invalid dtype while
castis set toFalse.
- Note:
This method preserves the ordering of the input data frame.
- classmethod is_valid(df: DataFrame | LazyFrame, /, *, cast: bool = False) bool[source]¶
Utility method to check whether
validate()raises an exception.- Args:
df: The data frame to check for validity. allow_extra_columns: Whether to allow the data frame to contain columns
that are not defined in the schema.
- cast: Whether columns with a wrong data type in the input data frame are
cast to the schema’s defined data type before running validation. If set to
False, a wrong data type will result in a return value ofFalse.
- Returns:
Whether the provided dataframe can be validated with this schema.
- classmethod matches(other: type[Schema]) bool[source]¶
Check whether this schema semantically matches another schema.
This method checks whether the schemas have the same columns (with the same data types and constraints) as well as the same rules.
- Args:
other: The schema to compare with.
- Returns:
Whether the schemas are semantically equal.
- classmethod polars_schema() Schema[source]¶
Obtain the polars schema for this schema.
- Returns:
A
polarsschema that mirrors the schema defined by this class.
- classmethod primary_keys() list[str][source]¶
The primary key columns in this schema (possibly empty).
- classmethod pyarrow_schema() pa.Schema[source]¶
Obtain the pyarrow schema for this schema.
- Returns:
A
pyarrowschema that mirrors the schema defined by this class.
- classmethod read_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) DataFrame[Self][source]¶
Read a Delta Lake table into a typed data frame with this schema.
Compared to
polars.read_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.- Args:
source: Path or DeltaTable object from which to read the data. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it withpolars.read_delta()to convey the purpose better_.
kwargs: Additional keyword arguments passed directly to
polars.read_delta().- Returns:
The data frame with this schema.
- Raises:
ValidationRequiredError: If no schema information can be read from the source and
validationis set to"forbid".- Attention:
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
This method suffers from the same limitations as
serialize().
- classmethod read_parquet(source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes], *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) DataFrame[Self][source]¶
Read a parquet file into a typed data frame with this schema.
Compared to
polars.read_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.- Args:
source: Path, directory, or file-like object from which to read the data. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it withpolars.read_parquet()to convey the purpose better_.
- kwargs: Additional keyword arguments passed directly to
polars.read_parquet().
- Returns:
The data frame with this schema.
- Raises:
- ValidationRequiredError: If no schema information can be read from the
source and
validationis set to"forbid".
- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- classmethod sample(num_rows: int | None = None, *, overrides: Mapping[str, Iterable[Any]] | Sequence[Mapping[str, Any]] | None = None, generator: Generator | None = None) DataFrame[Self][source]¶
Create a random data frame with a predefined number of rows.
Generally, this method should only be used for testing. Also, if you want to generate _realistic_ test data, it is inevitable to implement your custom sampling logic (by making use of the
Generatorclass).In order to allow for sampling random data frames in the presence of custom rules and primary key constraints, this method performs fuzzy sampling: it samples in a loop until it finds a data frame of length
num_rowswhich adhere to the schema. The maximum number of sampling rounds is configured viamax_sampling_iterationsin theConfigclass. By fixing this setting to 1, it is only possible to reliably sample from schemas without custom rules and without primary key constraints.- Args:
- num_rows: The (optional) number of rows to sample for creating the random
data frame. Must be provided (only) if no
overridesare provided. If this isNone, the number of rows in the data frame is determined by the length of the values inoverrides.- overrides: Fixed values for a subset of the columns of the sampled data
frame. Just like when initializing a
polars.DataFrame, overrides may either be provided as “column-” or “row-layout”, i.e. via a mapping or a list of mappings, respectively. The number of rows in the result data frame is equal to the length of the values inoverrides. If bothoverridesandnum_rowsare provided, the length of the values inoverridesmust be equal tonum_rows. The order of the items is guaranteed to match the ordering in the returned data frame. When providing values for a column, no sampling is performed for that column.- generator: The (seeded) generator to use for sampling data. If
None, a generator with random seed is automatically created.
- Returns:
A data frame valid under the current schema with a number of rows that matches the length of the values in
overridesornum_rows.- Raises:
- ValueError: If
num_rowsis not equal to the length of the values in overrides.- ValueError: If no valid data frame can be found in the configured maximum
number of iterations.
- ValueError: If
- Attention:
Be aware that, due to sampling in a loop, the runtime of this method can be significant for complex schemas. Consider passing a seeded generator and evaluate whether the runtime impact in the tests is bearable. Alternatively, it can be beneficial to provide custom column overrides for columns associated with complex validation rules.
- classmethod scan_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) LazyFrame[Self][source]¶
Lazily read a Delta Lake table into a typed data frame with this schema.
Compared to
polars.scan_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.- Args:
source: Path or DeltaTable object from which to read the data. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it withpolars.scan_delta()to convey the purpose better_.
kwargs: Additional keyword arguments passed directly to
polars.scan_delta().- Returns:
The lazy data frame with this schema.
- Raises:
ValidationRequiredError: If no schema information can be read from the source and
validationis set to"forbid".- Attention:
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
This method suffers from the same limitations as
serialize().
- classmethod scan_parquet(source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes], *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) LazyFrame[Self][source]¶
Lazily read a parquet file into a typed data frame with this schema.
Compared to
polars.scan_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.- Args:
source: Path, directory, or file-like object from which to read the data. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it withpolars.scan_parquet()to convey the purpose better_.
- kwargs: Additional keyword arguments passed directly to
polars.scan_parquet().
- Returns:
The data frame with this schema.
- Raises:
- ValidationRequiredError: If no schema information can be read from the
source and
validationis set to"forbid".
- Note:
Due to current limitations in dataframely, this method actually reads the parquet file into memory if
validationis"warn"or"allow"and validation is required.- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- classmethod serialize() str[source]¶
Serialize this schema to a JSON string.
- Returns:
The serialized schema.
- Note:
Serialization within dataframely itself will remain backwards-compatible at least within a major version. Until further notice, it will also be backwards-compatible across major versions.
- Attention:
Serialization of
polarsexpressions is not guaranteed to be stable across versions of polars. This affects schemas that define custom rules or columns with custom checks: a schema serialized with one version of polars may not be deserializable with another version of polars.- Attention:
This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.
- Raises:
TypeError: If any column contains metadata that is not JSON-serializable. ValueError: If any column is not a “native” dataframely column type but
a custom subclass.
- classmethod sink_parquet(lf: LazyFrame[Self], /, file: str | Path | IO[bytes] | PartitioningScheme, **kwargs: Any) None[source]¶
Stream a typed lazy frame with this schema to a parquet file.
This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by
read_parquet()andscan_parquet()for more efficient reading, or by external tools.- Args:
lf: The lazy frame to write to the parquet file. file: The file path, writable file-like object, or partitioning scheme to
which to write the parquet file.
- kwargs: Additional keyword arguments passed directly to
polars.write_parquet().metadatamay only be provided if it is a dictionary.
- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- classmethod sql_schema(dialect: sa.Dialect) list[sa.Column][source]¶
Obtain the SQL schema for a particular dialect for this schema.
- Args:
- dialect: The dialect for which to obtain the SQL schema. Note that column
datatypes may differ across dialects.
- Returns:
A list of
sqlalchemycolumns that can be used to create a table with the schema as defined by this class.
- classmethod validate(df: DataFrame | LazyFrame, /, *, cast: bool = False) DataFrame[Self][source]¶
Validate that a data frame satisfies the schema.
- Args:
df: The data frame to validate. cast: Whether columns with a wrong data type in the input data frame are
cast to the schema’s defined data type if possible.
- Returns:
The (collected) input data frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence. The data frame is guaranteed to maintain its order.
- Raises:
- ValidationError: If the input data frame does not satisfy the schema
definition.
- Note:
This method _always_ collects the input data frame in order to raise potential validation errors.
- classmethod write_delta(df: DataFrame[Self], /, target: str | Path | deltalake.DeltaTable, **kwargs: Any) None[source]¶
Write a typed data frame with this schema to a Delta Lake table.
This method automatically adds a serialization of this schema to the Delta Lake table as metadata. The metadata can be leveraged by
read_delta()andscan_delta()for efficient reading or by external tools.- Args:
df: The data frame to write to the Delta Lake table. target: The path or DeltaTable object to which to write the data. kwargs: Additional keyword arguments passed directly to
polars.write_delta().- Attention:
This method suffers from the same limitations as
serialize().Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
- classmethod write_parquet(df: DataFrame[Self], /, file: str | Path | IO[bytes], **kwargs: Any) None[source]¶
Write a typed data frame with this schema to a parquet file.
This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by
read_parquet()andscan_parquet()for more efficient reading, or by external tools.- Args:
df: The data frame to write to the parquet file. file: The file path or writable file-like object to which to write the
parquet file. This should be a path to a directory if writing a partitioned dataset.
- kwargs: Additional keyword arguments passed directly to
polars.write_parquet().metadatamay only be provided if it is a dictionary.
- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- class dataframely.String(*, nullable: bool | None = None, primary_key: bool = False, min_length: int | None = None, max_length: int | None = None, regex: str | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ColumnA column of strings.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Struct(inner: dict[str, Column], *, nullable: bool | None = None, primary_key: bool = False, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
ColumnA struct column.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.Time(*, nullable: bool | None = None, primary_key: bool = False, min: time | None = None, min_exclusive: time | None = None, max: time | None = None, max_exclusive: time | None = None, resolution: str | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
OrdinalMixin[time],ColumnA column of times (without date).
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules(expr)A set of rules evaluating whether a data frame column satisfies the column's constraints.
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- property name: str¶
Get the name of the column in a schema.
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.UInt16(*, nullable: bool | None = None, primary_key: bool = False, min: int | None = None, min_exclusive: int | None = None, max: int | None = None, max_exclusive: int | None = None, is_in: Sequence[int] | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseIntegerA column of uint16 values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- is_unsigned = True¶
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 65535¶
- min_value = 0¶
- property name: str¶
Get the name of the column in a schema.
- num_bytes = 2¶
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.UInt32(*, nullable: bool | None = None, primary_key: bool = False, min: int | None = None, min_exclusive: int | None = None, max: int | None = None, max_exclusive: int | None = None, is_in: Sequence[int] | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseIntegerA column of uint32 values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- is_unsigned = True¶
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 4294967295¶
- min_value = 0¶
- property name: str¶
Get the name of the column in a schema.
- num_bytes = 4¶
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.UInt64(*, nullable: bool | None = None, primary_key: bool = False, min: int | None = None, min_exclusive: int | None = None, max: int | None = None, max_exclusive: int | None = None, is_in: Sequence[int] | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseIntegerA column of uint64 values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- is_unsigned = True¶
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 18446744073709551615¶
- min_value = 0¶
- property name: str¶
Get the name of the column in a schema.
- num_bytes = 8¶
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- class dataframely.UInt8(*, nullable: bool | None = None, primary_key: bool = False, min: int | None = None, min_exclusive: int | None = None, max: int | None = None, max_exclusive: int | None = None, is_in: Sequence[int] | None = None, check: Callable[[Expr], Expr] | Sequence[Callable[[Expr], Expr]] | Mapping[str, Callable[[Expr], Expr]] | None = None, alias: str | None = None, metadata: dict[str, Any] | None = None)[source]¶
Bases:
_BaseIntegerA column of uint8 values.
- Attributes:
colObtain a Polars column expression for the column.
dtypeThe
polarsdtype equivalent of this column definition’s data type.nameGet the name of the column in a schema.
pyarrow_dtypeThe
pyarrowdtype equivalent of this column data type.
Methods
as_dict(expr)Turn the column definition into a dictionary.
from_dict(data)Read the column definition from a dictionary.
matches(other, expr)Check whether this column semantically matches another column.
pyarrow_field(name)Obtain the pyarrow field of this column definition.
sample(generator[, n])Sample random elements adhering to the constraints of this column.
sqlalchemy_column(name, dialect)Obtain the SQL column specification of this column definition.
sqlalchemy_dtype(dialect)The
sqlalchemydtype equivalent of this column data type.validate_dtype(dtype)Validate if the
polarsdata type satisfies the column definition.validation_rules
- as_dict(expr: Expr) dict[str, Any][source]¶
Turn the column definition into a dictionary.
If the column definition references other column definitions, they will be turned into dictionaries recursively.
- Args:
- expr: An expression referencing the column to turn into a dictionary. This
is required to properly encode custom checks.
- Returns:
The column definition as dictionary.
- Note:
This method stores custom checks as expressions rather than callables to allow for serialization.
- Note:
Do NOT use the returned object to evaluate semantic equality of two columns. It may yield different results than
matches().- Attention:
This method is only intended for internal use.
- property col: Expr¶
Obtain a Polars column expression for the column.
- property dtype: DataType¶
The
polarsdtype equivalent of this column definition’s data type.This is primarily used for creating empty data frames with an appropriate schema. Thus, it should describe the default dtype equivalent if this data type encompasses multiple underlying data types.
- classmethod from_dict(data: dict[str, Any]) Self[source]¶
Read the column definition from a dictionary.
- Args:
data: The dictionary that was created via
as_dict().- Returns:
The column definition read from the dictionary.
- Attention:
This method is only intended for internal use.
- is_unsigned = True¶
- matches(other: Column, expr: Expr) bool[source]¶
Check whether this column semantically matches another column.
- Args:
other: The column to compare with. expr: An expression referencing the column. This is required to properly
evaluate the equivalence of custom checks.
- Returns:
Whether the columns are semantically equal.
- max_value = 255¶
- min_value = 0¶
- property name: str¶
Get the name of the column in a schema.
- num_bytes = 1¶
- property pyarrow_dtype: pa.DataType¶
The
pyarrowdtype equivalent of this column data type.
- pyarrow_field(name: str) pa.Field[source]¶
Obtain the pyarrow field of this column definition.
- Args:
name: The name of the column.
- Returns:
The
pyarrowfield definition.
- sample(generator: Generator, n: int = 1) Series[source]¶
Sample random elements adhering to the constraints of this column.
- Args:
generator: The generator to use for sampling elements. n: The number of elements to sample.
- Returns:
A series with the predefined number of elements. All elements are guaranteed to adhere to the column’s constraints.
- Raises:
- ValueError: If this column has a custom check. In this case, random values
cannot be guaranteed to adhere to the column’s constraints while providing any guarantees on the computational complexity.
- sqlalchemy_column(name: str, dialect: sa.Dialect) sa.Column[source]¶
Obtain the SQL column specification of this column definition.
- Args:
name: The name of the column. dialect: The SQL dialect for which to generate the column specification.
- Returns:
The column as specified in
sqlalchemy.
- sqlalchemy_dtype(dialect: sa.Dialect) sa_TypeEngine[source]¶
The
sqlalchemydtype equivalent of this column data type.
- validate_dtype(dtype: DataType | DataTypeClass) bool[source]¶
Validate if the
polarsdata type satisfies the column definition.- Args:
dtype: The dtype to validate.
- Returns:
Whether the dtype is valid.
- validation_rules(expr: Expr) dict[str, Expr][source]¶
A set of rules evaluating whether a data frame column satisfies the column’s constraints.
- Args:
- expr: An expression referencing the column of the data frame, i.e. an
expression created by calling
polars.col().
- Returns:
A mapping from validation rule names to expressions that provide exactly one boolean value per column item indicating whether validation with respect to the rule is successful. A value of
Falseindicates invalid data, i.e. unsuccessful validation.
- dataframely.concat_collection_members(collections: Sequence[C], /) dict[str, LazyFrame][source]¶
Concatenate the members of collections with the same type.
- Args:
- collections: The collections whose members to concatenate. Optional members
are concatenated only from the collections that provide them.
- Returns:
A mapping from member names to a lazy concatenation of data frames. All keys are guaranteed to be valid members of the collection.
- dataframely.deserialize_collection(data: str) type[Collection][source]¶
Deserialize a collection from a JSON string.
This method allows to dynamically load a collection from its serialization, without having to know the collection to load in advance.
- Args:
data: The JSON string created via
Collection.serialize().- Returns:
The collection loaded from the JSON data.
- Raises:
ValueError: If the schema format version is not supported.
- Attention:
The returned collection cannot be used to create instances of the collection as filters cannot be correctly recovered from the serialized format as of polars 1.31. Thus, you should only use static information from the returned collection.
- Attention:
This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.
- See also:
Collection.serialize()for additional information on serialization.
- dataframely.deserialize_schema(data: str, strict: bool = True) type[Schema] | None[source]¶
Deserialize a schema from a JSON string.
This method allows to dynamically load a schema from its serialization, without having to know the schema to load in advance.
- Args:
data: The JSON string created via
Schema.serialize(). strict: Whether to raise an exception if the schema cannot be deserialized.- Returns:
The schema loaded from the JSON data.
- Raises:
ValueError: If the schema format version is not supported and
strict=True.- Attention:
This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.
- See also:
Schema.serialize()for additional information on serialization.
- dataframely.filter() Callable[[Callable[[C], LazyFrame]], Filter[C]][source]¶
Mark a function as filters for rows in the members of a collection.
The name of the function will be used as the name of the filter. The name must not clash with the name of any column in the member schemas or rules defined on the member schemas.
A filter receives a collection as input and must return a data frame like the following:
The columns must be a superset of the common primary keys across all members.
The rows must provide the primary keys which ought to be kept across the members. The filter results in the removal of rows which are lost as the result of inner-joining members onto the return value of this function.
- Attention:
Make sure to provide unique combinations of the primary keys or the filters might introduce duplicate rows.
- Attention:
The filter logic should return a lazy frame with a static computational graph. Other implementations using arbitrary python logic works for filtering and validation, but may lead to wrong results in Collection comparisons and (de-)serialization.
- dataframely.filter_relationship_one_to_at_least_one(lhs: LazyFrame[S] | LazyFrame, rhs: LazyFrame[T] | LazyFrame, /, on: str | list[str]) LazyFrame[source]¶
Express a 1:{1,N} mapping between data frames for a collection filter.
- Args:
lhs: The data frame with exactly one occurrence for a set of key columns. rhs: The data frame with at least one occurrence for a set of key columns. on: The columns to join the data frames on. If not provided, the join columns
are inferred from the joint primary keys of the provided data frames.
- dataframely.filter_relationship_one_to_one(lhs: LazyFrame[S] | LazyFrame, rhs: LazyFrame[T] | LazyFrame, /, on: str | list[str]) LazyFrame[source]¶
Express a 1:1 mapping between data frames for a collection filter.
- Args:
lhs: The first data frame in the 1:1 mapping. rhs: The second data frame in the 1:1 mapping. on: The columns to join the data frames on. If not provided, the join columns
are inferred from the mutual primary keys of the provided data frames.
- dataframely.read_parquet_metadata_collection(source: str | Path | IO[bytes] | bytes) type[Collection] | None[source]¶
Read a dataframely Collection type from the metadata of a parquet file.
- Args:
source: Path to a parquet file or a file-like object that contains the metadata.
- Returns:
The collection that was serialized to the metadata.
Noneif no collection metadata is found or the deserialization fails.
- dataframely.read_parquet_metadata_schema(source: str | Path | IO[bytes] | bytes) type[Schema] | None[source]¶
Read a dataframely schema from the metadata of a parquet file.
- Args:
source: Path to a parquet file or a file-like object that contains the metadata.
- Returns:
The schema that was serialized to the metadata.
Noneif no schema metadata is found or the deserialization fails.
- dataframely.rule(*, group_by: list[str] | None = None) Callable[[Callable[[], Expr]], Rule][source]¶
Mark a function as a rule to evaluate during validation.
The name of the function will be used as the name of the rule. The function should return an expression providing a boolean value whether a row is valid wrt. the rule. A value of
trueindicates validity.Rules should be used only in the following two circumstances:
Validation requires accessing multiple columns (e.g. if valid values of column A depend on the value in column B).
Validation must be performed on groups of rows (e.g. if a column A must not contain any duplicate values among rows with the same value in column B).
In all other instances, column-level validation rules should be preferred as it aids readability and improves error messages.
- Args:
- group_by: An optional list of columns to group by for rules operating on groups
of rows. If this list is provided, the returned expression must return a single boolean value, i.e. some kind of aggregation function must be used (e.g.
sum,any, …).
- Note:
You’ll need to explicitly handle
nullvalues in your columns when defining rules. By default, any rule that evaluates tonullbecause one of the columns used in the rule isnullis interpreted astrue, i.e. the row is assumed to be valid.- Attention:
The rule logic should return a static result. Other implementations using arbitrary python logic works for filtering and validation, but may lead to wrong results in Schema comparisons and (de-)serialization.
Subpackages¶
- dataframely.columns package
AnyArrayBinaryBoolCategoricalCategorical.as_dict()Categorical.colCategorical.dtypeCategorical.from_dict()Categorical.matches()Categorical.nameCategorical.pyarrow_dtypeCategorical.pyarrow_field()Categorical.sample()Categorical.sqlalchemy_column()Categorical.sqlalchemy_dtype()Categorical.validate_dtype()Categorical.validation_rules()
ColumnDateDatetimeDecimalDurationEnumFloatFloat32Float32.as_dict()Float32.colFloat32.dtypeFloat32.from_dict()Float32.matches()Float32.max_valueFloat32.min_valueFloat32.nameFloat32.pyarrow_dtypeFloat32.pyarrow_field()Float32.sample()Float32.sqlalchemy_column()Float32.sqlalchemy_dtype()Float32.validate_dtype()Float32.validation_rules()
Float64Float64.as_dict()Float64.colFloat64.dtypeFloat64.from_dict()Float64.matches()Float64.max_valueFloat64.min_valueFloat64.nameFloat64.pyarrow_dtypeFloat64.pyarrow_field()Float64.sample()Float64.sqlalchemy_column()Float64.sqlalchemy_dtype()Float64.validate_dtype()Float64.validation_rules()
Int16Int16.as_dict()Int16.colInt16.dtypeInt16.from_dict()Int16.is_unsignedInt16.matches()Int16.max_valueInt16.min_valueInt16.nameInt16.num_bytesInt16.pyarrow_dtypeInt16.pyarrow_field()Int16.sample()Int16.sqlalchemy_column()Int16.sqlalchemy_dtype()Int16.validate_dtype()Int16.validation_rules()
Int32Int32.as_dict()Int32.colInt32.dtypeInt32.from_dict()Int32.is_unsignedInt32.matches()Int32.max_valueInt32.min_valueInt32.nameInt32.num_bytesInt32.pyarrow_dtypeInt32.pyarrow_field()Int32.sample()Int32.sqlalchemy_column()Int32.sqlalchemy_dtype()Int32.validate_dtype()Int32.validation_rules()
Int64Int64.as_dict()Int64.colInt64.dtypeInt64.from_dict()Int64.is_unsignedInt64.matches()Int64.max_valueInt64.min_valueInt64.nameInt64.num_bytesInt64.pyarrow_dtypeInt64.pyarrow_field()Int64.sample()Int64.sqlalchemy_column()Int64.sqlalchemy_dtype()Int64.validate_dtype()Int64.validation_rules()
Int8IntegerInteger.as_dict()Integer.colInteger.dtypeInteger.from_dict()Integer.is_unsignedInteger.matches()Integer.max_valueInteger.min_valueInteger.nameInteger.num_bytesInteger.pyarrow_dtypeInteger.pyarrow_field()Integer.sample()Integer.sqlalchemy_column()Integer.sqlalchemy_dtype()Integer.validate_dtype()Integer.validation_rules()
ListObjectStringStructTimeUInt16UInt16.as_dict()UInt16.colUInt16.dtypeUInt16.from_dict()UInt16.is_unsignedUInt16.matches()UInt16.max_valueUInt16.min_valueUInt16.nameUInt16.num_bytesUInt16.pyarrow_dtypeUInt16.pyarrow_field()UInt16.sample()UInt16.sqlalchemy_column()UInt16.sqlalchemy_dtype()UInt16.validate_dtype()UInt16.validation_rules()
UInt32UInt32.as_dict()UInt32.colUInt32.dtypeUInt32.from_dict()UInt32.is_unsignedUInt32.matches()UInt32.max_valueUInt32.min_valueUInt32.nameUInt32.num_bytesUInt32.pyarrow_dtypeUInt32.pyarrow_field()UInt32.sample()UInt32.sqlalchemy_column()UInt32.sqlalchemy_dtype()UInt32.validate_dtype()UInt32.validation_rules()
UInt64UInt64.as_dict()UInt64.colUInt64.dtypeUInt64.from_dict()UInt64.is_unsignedUInt64.matches()UInt64.max_valueUInt64.min_valueUInt64.nameUInt64.num_bytesUInt64.pyarrow_dtypeUInt64.pyarrow_field()UInt64.sample()UInt64.sqlalchemy_column()UInt64.sqlalchemy_dtype()UInt64.validate_dtype()UInt64.validation_rules()
UInt8UInt8.as_dict()UInt8.colUInt8.dtypeUInt8.from_dict()UInt8.is_unsignedUInt8.matches()UInt8.max_valueUInt8.min_valueUInt8.nameUInt8.num_bytesUInt8.pyarrow_dtypeUInt8.pyarrow_field()UInt8.sample()UInt8.sqlalchemy_column()UInt8.sqlalchemy_dtype()UInt8.validate_dtype()UInt8.validation_rules()
column_from_dict()- Submodules
- dataframely.columns.any module
- dataframely.columns.array module
- dataframely.columns.binary module
- dataframely.columns.bool module
- dataframely.columns.categorical module
CategoricalCategorical.as_dict()Categorical.colCategorical.dtypeCategorical.from_dict()Categorical.matches()Categorical.nameCategorical.pyarrow_dtypeCategorical.pyarrow_field()Categorical.sample()Categorical.sqlalchemy_column()Categorical.sqlalchemy_dtype()Categorical.validate_dtype()Categorical.validation_rules()
- dataframely.columns.datetime module
- dataframely.columns.decimal module
- dataframely.columns.enum module
- dataframely.columns.float module
FloatFloat32Float32.as_dict()Float32.colFloat32.dtypeFloat32.from_dict()Float32.matches()Float32.max_valueFloat32.min_valueFloat32.nameFloat32.pyarrow_dtypeFloat32.pyarrow_field()Float32.sample()Float32.sqlalchemy_column()Float32.sqlalchemy_dtype()Float32.validate_dtype()Float32.validation_rules()
Float64Float64.as_dict()Float64.colFloat64.dtypeFloat64.from_dict()Float64.matches()Float64.max_valueFloat64.min_valueFloat64.nameFloat64.pyarrow_dtypeFloat64.pyarrow_field()Float64.sample()Float64.sqlalchemy_column()Float64.sqlalchemy_dtype()Float64.validate_dtype()Float64.validation_rules()
- dataframely.columns.integer module
Int16Int16.as_dict()Int16.colInt16.dtypeInt16.from_dict()Int16.is_unsignedInt16.matches()Int16.max_valueInt16.min_valueInt16.nameInt16.num_bytesInt16.pyarrow_dtypeInt16.pyarrow_field()Int16.sample()Int16.sqlalchemy_column()Int16.sqlalchemy_dtype()Int16.validate_dtype()Int16.validation_rules()
Int32Int32.as_dict()Int32.colInt32.dtypeInt32.from_dict()Int32.is_unsignedInt32.matches()Int32.max_valueInt32.min_valueInt32.nameInt32.num_bytesInt32.pyarrow_dtypeInt32.pyarrow_field()Int32.sample()Int32.sqlalchemy_column()Int32.sqlalchemy_dtype()Int32.validate_dtype()Int32.validation_rules()
Int64Int64.as_dict()Int64.colInt64.dtypeInt64.from_dict()Int64.is_unsignedInt64.matches()Int64.max_valueInt64.min_valueInt64.nameInt64.num_bytesInt64.pyarrow_dtypeInt64.pyarrow_field()Int64.sample()Int64.sqlalchemy_column()Int64.sqlalchemy_dtype()Int64.validate_dtype()Int64.validation_rules()
Int8IntegerInteger.as_dict()Integer.colInteger.dtypeInteger.from_dict()Integer.is_unsignedInteger.matches()Integer.max_valueInteger.min_valueInteger.nameInteger.num_bytesInteger.pyarrow_dtypeInteger.pyarrow_field()Integer.sample()Integer.sqlalchemy_column()Integer.sqlalchemy_dtype()Integer.validate_dtype()Integer.validation_rules()
UInt16UInt16.as_dict()UInt16.colUInt16.dtypeUInt16.from_dict()UInt16.is_unsignedUInt16.matches()UInt16.max_valueUInt16.min_valueUInt16.nameUInt16.num_bytesUInt16.pyarrow_dtypeUInt16.pyarrow_field()UInt16.sample()UInt16.sqlalchemy_column()UInt16.sqlalchemy_dtype()UInt16.validate_dtype()UInt16.validation_rules()
UInt32UInt32.as_dict()UInt32.colUInt32.dtypeUInt32.from_dict()UInt32.is_unsignedUInt32.matches()UInt32.max_valueUInt32.min_valueUInt32.nameUInt32.num_bytesUInt32.pyarrow_dtypeUInt32.pyarrow_field()UInt32.sample()UInt32.sqlalchemy_column()UInt32.sqlalchemy_dtype()UInt32.validate_dtype()UInt32.validation_rules()
UInt64UInt64.as_dict()UInt64.colUInt64.dtypeUInt64.from_dict()UInt64.is_unsignedUInt64.matches()UInt64.max_valueUInt64.min_valueUInt64.nameUInt64.num_bytesUInt64.pyarrow_dtypeUInt64.pyarrow_field()UInt64.sample()UInt64.sqlalchemy_column()UInt64.sqlalchemy_dtype()UInt64.validate_dtype()UInt64.validation_rules()
UInt8UInt8.as_dict()UInt8.colUInt8.dtypeUInt8.from_dict()UInt8.is_unsignedUInt8.matches()UInt8.max_valueUInt8.min_valueUInt8.nameUInt8.num_bytesUInt8.pyarrow_dtypeUInt8.pyarrow_field()UInt8.sample()UInt8.sqlalchemy_column()UInt8.sqlalchemy_dtype()UInt8.validate_dtype()UInt8.validation_rules()
- dataframely.columns.list module
- dataframely.columns.object module
- dataframely.columns.string module
- dataframely.columns.struct module
- dataframely.testing package
create_collection()create_collection_raw()create_schema()evaluate_rules()rules_from_exprs()validation_mask()- Submodules
- dataframely.testing.const module
- dataframely.testing.factory module
- dataframely.testing.mask module
- dataframely.testing.rules module
- dataframely.testing.storage module
- dataframely.testing.typing module
MyImportedBaseSchemaMyImportedBaseSchema.aMyImportedBaseSchema.cast()MyImportedBaseSchema.column_names()MyImportedBaseSchema.columns()MyImportedBaseSchema.create_empty()MyImportedBaseSchema.create_empty_if_none()MyImportedBaseSchema.filter()MyImportedBaseSchema.is_valid()MyImportedBaseSchema.matches()MyImportedBaseSchema.polars_schema()MyImportedBaseSchema.primary_keys()MyImportedBaseSchema.pyarrow_schema()MyImportedBaseSchema.read_delta()MyImportedBaseSchema.read_parquet()MyImportedBaseSchema.sample()MyImportedBaseSchema.scan_delta()MyImportedBaseSchema.scan_parquet()MyImportedBaseSchema.serialize()MyImportedBaseSchema.sink_parquet()MyImportedBaseSchema.sql_schema()MyImportedBaseSchema.validate()MyImportedBaseSchema.write_delta()MyImportedBaseSchema.write_parquet()
MyImportedSchemaMyImportedSchema.aMyImportedSchema.bMyImportedSchema.cMyImportedSchema.cast()MyImportedSchema.column_names()MyImportedSchema.columns()MyImportedSchema.create_empty()MyImportedSchema.create_empty_if_none()MyImportedSchema.dMyImportedSchema.eMyImportedSchema.fMyImportedSchema.filter()MyImportedSchema.gMyImportedSchema.hMyImportedSchema.is_valid()MyImportedSchema.matches()MyImportedSchema.polars_schema()MyImportedSchema.primary_keys()MyImportedSchema.pyarrow_schema()MyImportedSchema.read_delta()MyImportedSchema.read_parquet()MyImportedSchema.sample()MyImportedSchema.scan_delta()MyImportedSchema.scan_parquet()MyImportedSchema.serialize()MyImportedSchema.sink_parquet()MyImportedSchema.some_decimalMyImportedSchema.sql_schema()MyImportedSchema.validate()MyImportedSchema.write_delta()MyImportedSchema.write_parquet()
Submodules¶
dataframely.collection module¶
- class dataframely.collection.Collection[source]¶
Bases:
BaseCollection,ABCBase class for all collections of data frames with a predefined schema.
A collection is comprised of a set of members which are collectively “consistent”, meaning they the collection ensures that invariants are held up across members. This is different to
dataframelyschemas which only ensure invariants within individual members.In order to properly ensure that invariants hold up across members, members must have a “common primary key”, i.e. there must be an overlap of at least one primary key column across all members. Consequently, a collection is typically used to represent “semantic objects” which cannot be represented in a single data frame due to 1-N relationships that are managed in separate data frames.
A collection must only have type annotations for :class:`~dataframely.LazyFrame`s with known schema:
class MyCollection(dy.Collection): first_member: dy.LazyFrame[MyFirstSchema] second_member: dy.LazyFrame[MySecondSchema]
Besides, it may define filters (c.f.
filter()) and arbitrary methods.- Note:
The
dataframelymypy plugin ensures that the dictionaries passed to class methods contain exactly the required keys.- Attention:
Do NOT use this class in combination with
from __future__ import annotationsas it requires the proper schema definitions to ensure that the collection is implemented correctly.
Methods
cast(data, /)Initialize a collection by casting all members into their correct schemas.
Collect all members of the collection.
The primary keys shared by non ignored members of the collection.
Create an empty collection without any data.
filter(data, /, *[, cast])Filter the members data frame by their schemas and the collection's filters.
The names of all members of the collection that are ignored in filters.
is_valid(data, /, *[, cast])Utility method to check whether
validate()raises an exception.join(primary_keys[, how, maintain_order])Filter the collection by joining onto a data frame containing entries for the common primary key columns whose respective rows should be kept or removed in the collection members.
matches(other)Check whether this collection semantically matches another.
The schemas of all members of the collection.
members()Information about the members of the collection.
The names of all members of the collection that are not ignored in filters (default).
The names of all optional members of the collection.
read_delta(source, *[, validation])Read all collection members from Delta Lake tables.
read_parquet(directory, *[, validation])Read all collection members from parquet files in a directory.
The names of all required members of the collection.
sample([num_rows, overrides, generator])Create a random sample from the members of this collection.
scan_delta(source, *[, validation])Lazily read all collection members from Delta Lake tables.
scan_parquet(directory, *[, validation])Lazily read all collection members from parquet files in a directory.
Serialize this collection to a JSON string.
sink_parquet(directory, **kwargs)Stream the members of this collection into parquet files in a directory.
to_dict()Return a dictionary representation of this collection.
validate(data, /, *[, cast])Validate that a set of data frames satisfy the collection's invariants.
write_delta(target, **kwargs)Write the members of this collection to Delta Lake tables.
write_parquet(directory, **kwargs)Write the members of this collection to parquet files in a directory.
- classmethod cast(data: Mapping[str, FrameType], /) Self[source]¶
Initialize a collection by casting all members into their correct schemas.
This method calls
cast()on every member, thus, removing superfluous columns and casting to the correct dtypes for all input data frames.You should typically use
validate()orfilter()to obtain instances of the collection as this method does not guarantee that the returned collection upholds any invariants. Nonetheless, it may be useful to use in instances where it is known that the provided data adheres to the collection’s invariants.- Args:
- data: The data for all members. The dictionary must contain exactly one
entry per member with the name of the member as key.
- Returns:
The initialized collection.
- Raises:
- ValueError: If an insufficient set of input data frames is provided, i.e. if
any required member of this collection is missing in the input.
- Attention:
For lazy frames, casting is not performed eagerly. This prevents collecting the lazy frames’ schemas but also means that a call to
collect()further down the line might fail because of the cast and/or missing columns.
- collect_all() Self[source]¶
Collect all members of the collection.
This method collects all members in parallel for maximum efficiency. It is particularly useful when
filter()is called with lazy frame inputs.- Returns:
The same collection with all members collected once.
- Note:
As all collection members are required to be lazy frames, the returned collection’s members are still “lazy”. However, they are “shallow-lazy”, meaning they are obtained by calling
.collect().lazy().
- classmethod common_primary_keys() list[str][source]¶
The primary keys shared by non ignored members of the collection.
- classmethod create_empty() Self[source]¶
Create an empty collection without any data.
This method simply calls
create_emptyon all member schemas, including non-optional ones.- Returns:
An instance of this collection.
- classmethod filter(data: Mapping[str, FrameType], /, *, cast: bool = False) tuple[Self, dict[str, FailureInfo]][source]¶
Filter the members data frame by their schemas and the collection’s filters.
- Args:
- data: The members of the collection which ought to be filtered. The
dictionary must contain exactly one entry per member with the name of the member as key, except for optional members which may be missing. All data frames passed here will be eagerly collected within the method, regardless of whether they are a
DataFrameorLazyFrame.- cast: Whether columns with a wrong data type in the member data frame are
cast to their schemas’ defined data types if possible.
- Returns:
A tuple of two items:
An instance of the collection which contains a subset of each of the input data frames with the rows which passed member-wise validation and were not filtered out by any of the collection’s filters. While collection members are always instances of
LazyFrame, the members of the returned collection are essentially eager as they are constructed by calling.lazy()on eager data frames. Just like in polars’ nativefilter(), the order of rows is maintained in all returned data frames.A mapping from member name to a
FailureInfoobject which provides details on why individual rows had been removed. Optional members are only included in this dictionary if they had been provided in the input.
- Raises:
- ValueError: If an insufficient set of input data frames is provided, i.e. if
any required member of this collection is missing in the input.
- ValidationError: If the columns of any of the input data frames are invalid.
This happens only if a data frame misses a column defined in its schema or a column has an invalid dtype while
castis set toFalse.
- classmethod ignored_members() set[str][source]¶
The names of all members of the collection that are ignored in filters.
- classmethod is_valid(data: Mapping[str, FrameType], /, *, cast: bool = False) bool[source]¶
Utility method to check whether
validate()raises an exception.- Args:
- data: The members of the collection which ought to be validated. The
dictionary must contain exactly one entry per member with the name of the member as key. The existence of all keys is checked via the
dataframelymypy plugin.- cast: Whether columns with a wrong data type in the member data frame are
cast to their schemas’ defined data types if possible.
- Returns:
Whether the provided members satisfy the invariants of the collection.
- Raises:
- ValueError: If an insufficient set of input data frames is provided, i.e. if
any required member of this collection is missing in the input.
- join(primary_keys: LazyFrame, how: Literal['semi', 'anti'] = 'semi', maintain_order: Literal['none', 'left'] = 'none') Self[source]¶
Filter the collection by joining onto a data frame containing entries for the common primary key columns whose respective rows should be kept or removed in the collection members.
- Args:
- primary_keys: The data frame to join on. Must contain the common primary key
columns of the collection.
- how: The join strategy to use. Like in polars, semi will keep all rows
that can be found in primary_keys, anti will remove them.
maintain_order: The maintain_order option to use for the polars join.
- Returns:
The collection, with members potentially reduced in length.
- Raises:
- ValueError: If the collection contains any member that is annotated with
ignored_in_filters=True.
- Attention:
This method does not validate the resulting collection. Ensure to only use this if the resulting collection still satisfies the filters of the collection. The joins are not evaluated eagerly. Therefore, a downstream call to
collect()might fail, especially if primary_keys does not contain all columns for all common primary keys.
- classmethod matches(other: type[Collection]) bool[source]¶
Check whether this collection semantically matches another.
- Args:
other: The collection to compare with.
- Returns:
Whether the two collections are semantically equal.
- Attention:
For custom filters, reliable comparison results are only guaranteed if the filter always returns a static polars expression. Otherwise, this function may falsely indicate a match.
- classmethod member_schemas() dict[str, type[Schema]][source]¶
The schemas of all members of the collection.
- classmethod members() dict[str, MemberInfo][source]¶
Information about the members of the collection.
- classmethod non_ignored_members() set[str][source]¶
The names of all members of the collection that are not ignored in filters (default).
- classmethod optional_members() set[str][source]¶
The names of all optional members of the collection.
- classmethod read_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) Self[source]¶
Read all collection members from Delta Lake tables.
This method reads each member from a Delta Lake table at the provided source location. The source can be a path, URI, or an existing DeltaTable object. Optional members are only read if present.
- Args:
source: The location or DeltaTable to read from. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the schema data from the parquet files. If the stored collection schema matches this collection schema, the collection is read without validation. If the stored schema mismatches this schema no metadata can be found in the parquets, or the files have conflicting metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the metadata stores a collection schema that matches this collection."skip": The method never runs validation and simply reads the data, entrusting the user that the schema is valid. _Use this option carefully_.
kwargs: Additional keyword arguments passed directly to
polars.read_delta().- Returns:
The initialized collection.
- Raises:
ValidationRequiredError: If no collection schema can be read from the source and
validationis set to"forbid". ValueError: If the provided source does not contain Delta tables for all required members. ValidationError: If the collection cannot be validated.- Attention:
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
Be aware that this method suffers from the same limitations as
serialize().
- classmethod read_parquet(directory: str | Path, *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) Self[source]¶
Read all collection members from parquet files in a directory.
This method searches for files named
<member>.parquetin the provided directory for all required and optional members of the collection.- Args:
- directory: The directory where the Parquet files should be read from.
Parquet files may have been written with Hive partitioning.
validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the schema data from the parquet files. If the stored collection schema matches this collection schema, the collection is read without validation. If the stored schema mismatches this schema no metadata can be found in the parquets, or the files have conflicting metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the metadata stores a collection schema that matches this collection."skip": The method never runs validation and simply reads the data, entrusting the user that the schema is valid. _Use this option carefully_.
- kwargs: Additional keyword arguments passed directly to
polars.read_parquet().
- Returns:
The initialized collection.
- Raises:
- ValidationRequiredError: If no collection schema can be read from the
directory and
validationis set to"forbid".- ValueError: If the provided directory does not contain parquet files for
all required members.
ValidationError: If the collection cannot be validate.
- Note:
This method is backward compatible with older versions of dataframely in which the schema metadata was saved to schema.json files instead of being encoded into the parquet files.
- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- classmethod required_members() set[str][source]¶
The names of all required members of the collection.
- classmethod sample(num_rows: int | None = None, *, overrides: Sequence[Mapping[str, Any]] | None = None, generator: Generator | None = None) Self[source]¶
Create a random sample from the members of this collection.
Just like sampling for schemas, this method should only be used for testing. Contrary to sampling for schemas, the core difficulty when sampling related values data frames is that they must share primary keys and individual members may have a different number of rows. For this reason, overrides passed to this function must be “row-oriented” (or “sample-oriented”).
- Args:
- num_rows: The number of rows to sample for each member. If this is set to
None, the number of rows is inferred from the length of the overrides.- overrides: The overrides to set values in member schemas. The overrides must
be provided as a list of samples. The structure of the samples must be as follows:
{ "<primary_key_1>": <value>, "<primary_key_2>": <value>, "<member_with_common_primary_key>": { "<column_1>": <value>, ... }, "<member_with_superkey_of_primary_key>": [ { "<column_1>": <value>, ... } ], ... }
Any member/value can be left out and will be sampled automatically. Note that overrides for columns of members that are annotated with
inline_for_sampling=Truecan be supplied on the top-level instead of in a nested dictionary.- generator: The (seeded) generator to use for sampling data. If
None, a generator with random seed is automatically created.
- Returns:
A collection where all members (including optional ones) have been sampled according to the input parameters.
- Attention:
In case the collection has members with a common primary key, the _preprocess_sample method must return distinct primary key values for each sample. The default implementation does this on a best-effort basis but may cause primary key violations. Hence, it is recommended to override this method and ensure that all primary key columns are set.
- Raises:
- ValueError: If the
_preprocess_sample()method does not return all common primary key columns for all samples.
- ValidationError: If the sampled members violate any of the collection
filters. If the collection does not have filters, this error is never raised. To prevent validation errors, overwrite the
_preprocess_sample()method appropriately.
- ValueError: If the
- classmethod scan_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) Self[source]¶
Lazily read all collection members from Delta Lake tables.
This method reads each member from a Delta Lake table at the provided source location. The source can be a path, URI, or an existing DeltaTable object. Optional members are only read if present.
- Args:
source: The location or DeltaTable to read from. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the schema data from the parquet files. If the stored collection schema matches this collection schema, the collection is read without validation. If the stored schema mismatches this schema no metadata can be found in the parquets, or the files have conflicting metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the metadata stores a collection schema that matches this collection."skip": The method never runs validation and simply reads the data, entrusting the user that the schema is valid. _Use this option carefully_.
kwargs: Additional keyword arguments passed directly to
polars.scan_delta().- Returns:
The initialized collection.
- Raises:
ValidationRequiredError: If no collection schema can be read from the source and
validationis set to"forbid". ValueError: If the provided source does not contain Delta tables for all required members.- Note:
Due to current limitations in dataframely, this method may read the Delta table into memory if
validationis"warn"or"allow"and validation is required.- Attention:
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
Be aware that this method suffers from the same limitations as
serialize().
- classmethod scan_parquet(directory: str | Path, *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) Self[source]¶
Lazily read all collection members from parquet files in a directory.
This method searches for files named
<member>.parquetin the provided directory for all required and optional members of the collection.- Args:
- directory: The directory where the Parquet files should be read from.
Parquet files may have been written with Hive partitioning.
validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the schema data from the parquet files. If the stored collection schema matches this collection schema, the collection is read without validation. If the stored schema mismatches this schema no metadata can be found in the parquets, or the files have conflicting metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the metadata stores a collection schema that matches this collection."skip": The method never runs validation and simply reads the data, entrusting the user that the schema is valid. _Use this option carefully_.
- kwargs: Additional keyword arguments passed directly to
polars.scan_parquet()for all members.
- Returns:
The initialized collection.
- Raises:
- ValidationRequiredError: If no collection schema can be read from the
directory and
validationis set to"forbid".- ValueError: If the provided directory does not contain parquet files for
all required members.
- Note:
Due to current limitations in dataframely, this method actually reads the parquet file into memory if
"validation"is"warn"or"allow"and validation is required.- Note: This method is backward compatible with older versions of dataframely
in which the schema metadata was saved to schema.json files instead of being encoded into the parquet files.
- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- classmethod serialize() str[source]¶
Serialize this collection to a JSON string.
This method does NOT serialize any data frames, but only the _structure_ of the collection, similar to
Schema.serialize().- Returns:
The serialized collection.
- Note:
Serialization within dataframely itself will remain backwards-compatible at least within a major version. Until further notice, it will also be backwards-compatible across major versions.
- Attention:
Serialization of
polarsexpressions and lazy frames is not guaranteed to be stable across versions of polars. This affects collections with filters or members that define custom rules or columns with custom checks: a collection serialized with one version of polars may not be deserializable with another version of polars.- Attention:
This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.
- Raises:
- TypeError: If a column of any member contains metadata that is not
JSON-serializable.
- ValueError: If a column of any member is not a “native” dataframely column
type but a custom subclass.
- sink_parquet(directory: str | Path, **kwargs: Any) None[source]¶
Stream the members of this collection into parquet files in a directory.
This method writes one parquet file per member into the provided directory. Each parquet file is named
<member>.parquet. No file is written for optional members which are not provided in the current collection.- Args:
- directory: The directory where the Parquet files should be written to. If
the directory does not exist, it is created automatically, including all of its parents.
- kwargs: Additional keyword arguments passed directly to
polars.sink_parquet()of all members.metadatamay only be provided if it is a dictionary.
- Attention:
This method suffers from the same limitations as
Schema.serialize().
- classmethod validate(data: Mapping[str, FrameType], /, *, cast: bool = False) Self[source]¶
Validate that a set of data frames satisfy the collection’s invariants.
- Args:
- data: The members of the collection which ought to be validated. The
dictionary must contain exactly one entry per member with the name of the member as key.
- cast: Whether columns with a wrong data type in the member data frame are
cast to their schemas’ defined data types if possible.
- Raises:
- ValueError: If an insufficient set of input data frames is provided, i.e. if
any required member of this collection is missing in the input.
- ValidationError: If any of the input data frames does not satisfy its schema
definition or the filters on this collection result in the removal of at least one row across any of the input data frames.
- Returns:
An instance of the collection. All members of the collection are guaranteed to be valid with respect to their respective schemas and the filters on this collection did not remove rows from any member. The input order of each member is maintained.
- write_delta(target: str | Path | deltalake.DeltaTable, **kwargs: Any) None[source]¶
Write the members of this collection to Delta Lake tables.
This method writes each member to a Delta Lake table at the provided target location. The target can be a path, URI, or an existing DeltaTable object. No table is written for optional members which are not provided in the current collection.
- Args:
- target: The location or DeltaTable where the data should be written.
If the location does not exist, it is created automatically, including all of its parents.
kwargs: Additional keyword arguments passed directly to
polars.write_delta().- Attention:
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
This method suffers from the same limitations as
Schema.serialize().
- write_parquet(directory: str | Path, **kwargs: Any) None[source]¶
Write the members of this collection to parquet files in a directory.
This method writes one parquet file per member into the provided directory. Each parquet file is named
<member>.parquet. No file is written for optional members which are not provided in the current collection.- Args:
- directory: The directory where the Parquet files should be written to. If
the directory does not exist, it is created automatically, including all of its parents.
- kwargs: Additional keyword arguments passed directly to
polars.write_parquet()of all members.metadatamay only be provided if it is a dictionary.
- Attention:
This method suffers from the same limitations as
Schema.serialize().
- dataframely.collection.deserialize_collection(data: str) type[Collection][source]¶
Deserialize a collection from a JSON string.
This method allows to dynamically load a collection from its serialization, without having to know the collection to load in advance.
- Args:
data: The JSON string created via
Collection.serialize().- Returns:
The collection loaded from the JSON data.
- Raises:
ValueError: If the schema format version is not supported.
- Attention:
The returned collection cannot be used to create instances of the collection as filters cannot be correctly recovered from the serialized format as of polars 1.31. Thus, you should only use static information from the returned collection.
- Attention:
This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.
- See also:
Collection.serialize()for additional information on serialization.
- dataframely.collection.read_parquet_metadata_collection(source: str | Path | IO[bytes] | bytes) type[Collection] | None[source]¶
Read a dataframely Collection type from the metadata of a parquet file.
- Args:
source: Path to a parquet file or a file-like object that contains the metadata.
- Returns:
The collection that was serialized to the metadata.
Noneif no collection metadata is found or the deserialization fails.
dataframely.config module¶
- class dataframely.config.Config(**options: Unpack[Options])[source]¶
Bases:
ContextDecoratorAn object to track global configuration for operations in dataframely.
Methods
__call__(func)Call self as a function.
Restore the defaults of the configuration.
set_max_sampling_iterations(iterations)Set the maximum number of sampling iterations to use on
Schema.sample().
- class dataframely.config.Options[source]¶
Bases:
TypedDictMethods
clear(/)Remove all items from the dict.
copy(/)Return a shallow copy of the dict.
fromkeys(iterable[, value])Create a new dictionary with keys from iterable and values set to value.
get(key[, default])Return the value for key if key is in the dictionary, else default.
items(/)Return a set-like object providing a view on the dict's items.
keys(/)Return a set-like object providing a view on the dict's keys.
pop(key[, default])If the key is not found, return the default if given; otherwise, raise a KeyError.
popitem(/)Remove and return a (key, value) pair as a 2-tuple.
setdefault(key[, default])Insert key with a value of default if key is not in the dictionary.
update([E, ]**F)If E is present and has a .keys() method, then does: for k in E.keys(): D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
values(/)Return an object providing a view on the dict's values.
- clear(/)¶
Remove all items from the dict.
- copy(/)¶
Return a shallow copy of the dict.
- classmethod fromkeys(iterable, value=None, /)¶
Create a new dictionary with keys from iterable and values set to value.
- get(key, default=None, /)¶
Return the value for key if key is in the dictionary, else default.
- items(/)¶
Return a set-like object providing a view on the dict’s items.
- keys(/)¶
Return a set-like object providing a view on the dict’s keys.
- max_sampling_iterations: int¶
The maximum number of iterations to use for “fuzzy” sampling.
- pop(key, default=<unrepresentable>, /)¶
If the key is not found, return the default if given; otherwise, raise a KeyError.
- popitem(/)¶
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
- setdefault(key, default=None, /)¶
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- update([E, ]**F) None. Update D from mapping/iterable E and F.¶
If E is present and has a .keys() method, then does: for k in E.keys(): D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
- values(/)¶
Return an object providing a view on the dict’s values.
dataframely.exc module¶
- exception dataframely.exc.AnnotationImplementationError(attr: str, kls: type)[source]¶
Bases:
ImplementationErrorError raised when the annotations of a collection are invalid.
- add_note(object, /)¶
Exception.add_note(note) – add a note to the exception
- args¶
- with_traceback(object, /)¶
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception dataframely.exc.DtypeValidationError(errors: dict[str, tuple[DataType | DataTypeClass, DataType | DataTypeClass]])[source]¶
Bases:
ValidationErrorValidation error raised when column dtypes are wrong.
- add_note(object, /)¶
Exception.add_note(note) – add a note to the exception
- args¶
- with_traceback(object, /)¶
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception dataframely.exc.ImplementationError[source]¶
Bases:
ExceptionError raised when a schema is implemented incorrectly.
- add_note(object, /)¶
Exception.add_note(note) – add a note to the exception
- args¶
- with_traceback(object, /)¶
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception dataframely.exc.MemberValidationError(errors: dict[str, ValidationError])[source]¶
Bases:
ValidationErrorValidation error raised when multiple members of a collection fail validation.
- add_note(object, /)¶
Exception.add_note(note) – add a note to the exception
- args¶
- with_traceback(object, /)¶
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception dataframely.exc.RuleValidationError(errors: dict[str, int])[source]¶
Bases:
ValidationErrorComplex validation error raised when rule validation fails.
- add_note(object, /)¶
Exception.add_note(note) – add a note to the exception
- args¶
- with_traceback(object, /)¶
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception dataframely.exc.ValidationError(message: str)[source]¶
Bases:
ExceptionError raised when
dataframelyvalidation encounters an issue.- add_note(object, /)¶
Exception.add_note(note) – add a note to the exception
- args¶
- with_traceback(object, /)¶
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception dataframely.exc.ValidationRequiredError[source]¶
Bases:
ExceptionError raised when validation is when reading a parquet file.
- add_note(object, /)¶
Exception.add_note(note) – add a note to the exception
- args¶
- with_traceback(object, /)¶
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
dataframely.failure module¶
- class dataframely.failure.FailureInfo(lf: LazyFrame, rule_columns: list[str], schema: type[S])[source]¶
Bases:
Generic[S]A container carrying information about rows failing validation in
Schema.filter().Methods
The number of validation failures per co-occurring rule validation failure.
counts()The number of validation failures for each individual rule.
invalid()The rows of the original data frame containing the invalid rows.
read_delta(source, **kwargs)Read a delta lake table with the failure info.
read_parquet(source, **kwargs)Read a parquet file with the failure info.
scan_delta(source, **kwargs)Lazily read a delta lake table with the failure info.
scan_parquet(source, **kwargs)Lazily read a parquet file with the failure info.
sink_parquet(file, **kwargs)Stream the failure info to a parquet file.
write_delta(target, **kwargs)Write the failure info to a delta lake table.
write_parquet(file, **kwargs)Write the failure info to a parquet file.
- cooccurrence_counts() dict[frozenset[str], int][source]¶
The number of validation failures per co-occurring rule validation failure.
In contrast to
counts(), this method provides additional information on whether a rule often fails because of another rule failing.- Returns:
A list providing tuples of (1) co-occurring rule validation failures and (2) the count of such failures.
- Attention:
This method should primarily be used for debugging as it is much slower than
counts().
- counts() dict[str, int][source]¶
The number of validation failures for each individual rule.
- Returns:
A mapping from rule name to counts. If a rule’s failure count is 0, it is not included here.
- classmethod read_delta(source: str | Path | deltalake.DeltaTable, **kwargs: Any) FailureInfo[Schema][source]¶
Read a delta lake table with the failure info.
- Args:
source: Path or DeltaTable from which to read the data. kwargs: Additional keyword arguments passed directly to
polars.read_delta().- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- classmethod read_parquet(source: str | Path | IO[bytes], **kwargs: Any) FailureInfo[Schema][source]¶
Read a parquet file with the failure info.
- Args:
source: Path, directory, or file-like object from which to read the data. kwargs: Additional keyword arguments passed directly to
polars.read_parquet().- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize()
- classmethod scan_delta(source: str | Path | deltalake.DeltaTable, **kwargs: Any) FailureInfo[Schema][source]¶
Lazily read a delta lake table with the failure info.
- Args:
source: Path or DeltaTable from which to read the data. kwargs: Additional keyword arguments passed directly to
polars.scan_delta().- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- classmethod scan_parquet(source: str | Path | IO[bytes], **kwargs: Any) FailureInfo[Schema][source]¶
Lazily read a parquet file with the failure info.
- Args:
source: Path, directory, or file-like object from which to read the data.
- Returns:
The failure info object.
- Raises:
ValueError: If no appropriate metadata can be found.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize()
- schema: type[S]¶
The schema used to create the input data frame.
- sink_parquet(file: str | Path | IO[bytes] | PartitioningScheme, **kwargs: Any) None[source]¶
Stream the failure info to a parquet file.
- Args:
- file: The file path or writable file-like object to which to write the
parquet file. This should be a path to a directory if writing a partitioned dataset.
- kwargs: Additional keyword arguments passed directly to
polars.sink_parquet().metadatamay only be provided if it is a dictionary.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- write_delta(target: str | Path | deltalake.DeltaTable, **kwargs: Any) None[source]¶
Write the failure info to a delta lake table.
- Args:
target: The file path or DeltaTable to which to write the delta lake data. kwargs: Additional keyword arguments passed directly to
polars.write_delta().- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
- write_parquet(file: str | Path | IO[bytes], **kwargs: Any) None[source]¶
Write the failure info to a parquet file.
- Args:
- file: The file path or writable file-like object to which to write the
parquet file. This should be a path to a directory if writing a partitioned dataset.
- kwargs: Additional keyword arguments passed directly to
polars.write_parquet().metadatamay only be provided if it is a dictionary.
- Attention:
Be aware that this method suffers from the same limitations as
Schema.serialize().
dataframely.functional module¶
- dataframely.functional.concat_collection_members(collections: Sequence[C], /) dict[str, LazyFrame][source]¶
Concatenate the members of collections with the same type.
- Args:
- collections: The collections whose members to concatenate. Optional members
are concatenated only from the collections that provide them.
- Returns:
A mapping from member names to a lazy concatenation of data frames. All keys are guaranteed to be valid members of the collection.
- dataframely.functional.filter_relationship_one_to_at_least_one(lhs: LazyFrame[S] | LazyFrame, rhs: LazyFrame[T] | LazyFrame, /, on: str | list[str]) LazyFrame[source]¶
Express a 1:{1,N} mapping between data frames for a collection filter.
- Args:
lhs: The data frame with exactly one occurrence for a set of key columns. rhs: The data frame with at least one occurrence for a set of key columns. on: The columns to join the data frames on. If not provided, the join columns
are inferred from the joint primary keys of the provided data frames.
- dataframely.functional.filter_relationship_one_to_one(lhs: LazyFrame[S] | LazyFrame, rhs: LazyFrame[T] | LazyFrame, /, on: str | list[str]) LazyFrame[source]¶
Express a 1:1 mapping between data frames for a collection filter.
- Args:
lhs: The first data frame in the 1:1 mapping. rhs: The second data frame in the 1:1 mapping. on: The columns to join the data frames on. If not provided, the join columns
are inferred from the mutual primary keys of the provided data frames.
dataframely.mypy module¶
- class dataframely.mypy.DataframelyPlugin(options: Options)[source]¶
Bases:
Plugin- Attributes:
- options
- python_version
Methods
get_additional_deps
get_attribute_hook
get_base_class_hook
get_class_attribute_hook
get_class_decorator_hook
get_class_decorator_hook_2
get_customize_class_mro_hook
get_dynamic_class_hook
get_function_hook
get_function_signature_hook
get_metaclass_hook
get_method_hook
get_method_signature_hook
get_type_analyze_hook
lookup_fully_qualified
report_config_data
set_modules
- get_additional_deps(file)¶
- get_attribute_hook(fullname)¶
- get_class_attribute_hook(fullname)¶
- get_class_decorator_hook(fullname)¶
- get_class_decorator_hook_2(fullname)¶
- get_customize_class_mro_hook(fullname)¶
- get_dynamic_class_hook(fullname)¶
- get_function_hook(fullname)¶
- get_function_signature_hook(fullname)¶
- get_metaclass_hook(fullname)¶
- get_type_analyze_hook(fullname)¶
- lookup_fully_qualified(fullname)¶
- options¶
- python_version¶
- report_config_data(ctx)¶
- set_modules(modules)¶
- dataframely.mypy.alter_collection_filter_return_type(ctx: MethodSigContext) FunctionLike[source]¶
Alter the return type for dy.Collection.filter to a TypedDict for the failure info.
- dataframely.mypy.alter_dataframe_iter_rows_return_type(ctx: MethodContext, schema_registry: dict[str, TypedDictType]) Type[source]¶
Alter the return type for dy.DataFrame.iter_rows to a TypedDict if named=True.
dataframely.random module¶
- class dataframely.random.Generator(seed: int | None = None)[source]¶
Bases:
objectType that allows to sample primitive types using a random number generator.
All generator methods are called
sample_<type>and, if applicable, allow specifying a lower (inclusive) and an upper (exclusive) bound for the type to be sampled.These methods can be used to sample higher-level types. To this end, users may also directly access the underlying
numpy_generatorto reuse the generator’s seeding.Methods
sample_binary([n, null_probability])Sample a list of binary values in the specified length range.
sample_bool([n, null_probability, p_true])Sample a list of booleans in the specified range.
sample_choice([n, null_probability, weights])Sample a list of elements from a list of choices with replacement.
sample_date([n, resolution, null_probability])Sample a list of dates in the provided range.
sample_datetime([n, resolution, time_zone, ...])Sample a list of datetimes in the provided range.
sample_duration([n, resolution, ...])Sample a list of durations in the provided range.
sample_float([n, null_probability, ...])Sample a list of floating point numbers in the specified range.
sample_int([n, null_probability])Sample a list of integers in the specified range.
Sample a single integer that can be used as a seed for other RNGs.
sample_string([n, null_probability])Sample a list of strings adhering to the provided regex.
sample_time([n, resolution, null_probability])Sample a list of times in the provided range.
- sample_binary(n: int = 1, *, min_bytes: int, max_bytes: int, null_probability: float = 0.0) Series[source]¶
Sample a list of binary values in the specified length range.
- Args:
n: The number of binary values to sample. min_bytes: The minimum number of bytes for each value. max_bytes: The maximum number of bytes for each value. null_probability: The probability of an element being
null.- Returns:
A series with
nelements of dtypeBinary.
- sample_bool(n: int = 1, *, null_probability: float = 0.0, p_true: float | None = None) Series[source]¶
Sample a list of booleans in the specified range.
- Args:
n: The number of booleans to sample. null_probability: The probability of an element being
null. p_true: Sampling probability for True within non-null samples.Default: 0.5 (uniform sampling)
- Returns:
A series with
nelements of dtypeBoolean.
- sample_choice(n: int = 1, *, choices: Sequence[T], null_probability: float = 0.0, weights: Sequence[float] | None = None) Series[source]¶
Sample a list of elements from a list of choices with replacement.
- Args:
n: The number of elements to sample. choices: The choices to sample from. null_probability: The probability of an element being
null. weights: A ordered weight vector for the different choices- Returns:
A series with
nelements of auto-inferred dtype.
- sample_date(n: int = 1, *, min: date, max: date | None, resolution: str | None = None, null_probability: float = 0.0) Series[source]¶
Sample a list of dates in the provided range.
- Args:
n: The number of dates to sample. min: The minimum date to sample (inclusive). max: The maximum date to sample (exclusive). ‘10000-01-01’ when
None. resolution: The resolution that dates in the column must have. This uses theformatting language used by
polarsdatetimeroundmethod.null_probability: The probability of an element being
null.- Returns:
A series with
nelements of dtypeDate.
- sample_datetime(n: int = 1, *, min: datetime, max: datetime | None, resolution: str | None = None, time_zone: str | tzinfo | None = None, time_unit: Literal['ns', 'us', 'ms'] = 'us', null_probability: float = 0.0) Series[source]¶
Sample a list of datetimes in the provided range.
- Args:
n: The number of datetimes to sample. min: The minimum datetime to sample (inclusive). max: The maximum datetime to sample (exclusive). ‘10000-01-01’ when
None. resolution: The resolution that datetimes in the column must have. This usesthe formatting language used by
polarsdatetimeroundmethod.time_unit: The time unit of the datetime column. Defaults to
us(microseconds). time_zone: The time zone that datetimes in the column must have. The timezone must use a valid IANA time zone name identifier e.x.
Etc/UTCorAmerica/New_York.null_probability: The probability of an element being
null.- Returns:
A series with
nelements of dtypeDatetime.
- sample_duration(n: int = 1, *, min: timedelta, max: timedelta, resolution: str | None = None, null_probability: float = 0.0) Series[source]¶
Sample a list of durations in the provided range.
- Args:
n: The number of durations to sample. min: The minimum duration to sample (inclusive). max: The maximum duration to sample (exclusive). resolution: The resolution that durations in the column must have. This uses
the formatting language used by
polarsdatetimeroundmethod.null_probability: The probability of an element being
null.- Returns:
A series with
nelements of dtypeDuration.
- sample_float(n: int = 1, *, min: float, max: float, null_probability: float = 0.0, nan_probability: float = 0.0, inf_probability: float = 0.0) Series[source]¶
Sample a list of floating point numbers in the specified range.
- Args:
n: The number of floats to sample. min: The minimum float to sample (inclusive). max: The maximum float to sample (exclusive). null_probability: The probability of an element being
null. nan_probability: The probability of an element beingnan. inf_probability: The probability of an element beinginf.- Returns:
A series with
nelements of dtypeFloat64.
- sample_int(n: int = 1, *, min: int, max: int, null_probability: float = 0.0) Series[source]¶
Sample a list of integers in the specified range.
- Args:
n: The number of integers to sample. min: The minimum integer to sample (inclusive). max: The maximum integer to sample (exclusive). null_probability: The probability of an element being
null.- Returns:
A series with
nelements of dtypeInt64.
- sample_seed() int[source]¶
Sample a single integer that can be used as a seed for other RNGs.
- Returns:
A seed of type
uint32.
- sample_string(n: int = 1, *, regex: str, null_probability: float = 0.0) Series[source]¶
Sample a list of strings adhering to the provided regex.
- Args:
n: The number of strings to sample. regex: The regex that all elements have to adhere to. null_probability: The probability of an element being
null.- Returns:
A series with
nelements of dtypeString.
- sample_time(n: int = 1, *, min: time, max: time | None, resolution: str | None = None, null_probability: float = 0.0) Series[source]¶
Sample a list of times in the provided range.
- Args:
n: The number of times to sample. min: The minimum time to sample (inclusive). max: The maximum time to sample (exclusive). Midnight when
None. resolution: The resolution that times in the column must have. This uses theformatting language used by
polarsdatetimeroundmethod.null_probability: The probability of an element being
null.- Returns:
A series with
nelements of dtypeTime.
dataframely.schema module¶
- class dataframely.schema.Schema[source]¶
Bases:
BaseSchema,ABCBase class for all custom data frame schema definitions.
A custom schema should only define its columns via simple assignment:
class MySchema(Schema): a = dataframely.Int64() b = dataframely.String()
All definitions using non-datatype classes are ignored.
Schemas can also be nested (arbitrarily deeply): in this case, the columns defined in the subclass are simply appended to the columns in the superclass(es).
Methods
cast()Cast a data frame to match the schema.
The column names of this schema.
columns()The column definitions of this schema.
Create an empty data or lazy frame from this schema.
Impute
Noneinput with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.filter(df, /, *[, cast])Filter the data frame by the rules of this schema.
is_valid(df, /, *[, cast])Utility method to check whether
validate()raises an exception.matches(other)Check whether this schema semantically matches another schema.
Obtain the polars schema for this schema.
The primary key columns in this schema (possibly empty).
Obtain the pyarrow schema for this schema.
read_delta(source, *[, validation])Read a Delta Lake table into a typed data frame with this schema.
read_parquet(source, *[, validation])Read a parquet file into a typed data frame with this schema.
sample([num_rows, overrides, generator])Create a random data frame with a predefined number of rows.
scan_delta(source, *[, validation])Lazily read a Delta Lake table into a typed data frame with this schema.
scan_parquet(source, *[, validation])Lazily read a parquet file into a typed data frame with this schema.
Serialize this schema to a JSON string.
sink_parquet(lf, /, file, **kwargs)Stream a typed lazy frame with this schema to a parquet file.
sql_schema(dialect)Obtain the SQL schema for a particular dialect for this schema.
validate(df, /, *[, cast])Validate that a data frame satisfies the schema.
write_delta(df, /, target, **kwargs)Write a typed data frame with this schema to a Delta Lake table.
write_parquet(df, /, file, **kwargs)Write a typed data frame with this schema to a parquet file.
- classmethod cast(df: DataFrame, /) DataFrame[Self][source]¶
- classmethod cast(df: LazyFrame, /) LazyFrame[Self]
Cast a data frame to match the schema.
This method removes superfluous columns and casts all schema columns to the correct dtypes. However, it does not introspect the data frame contents.
Hence, this method should be used with care and
validate()should generally be preferred. It is advised to only use this method ifdfis surely known to adhere to the schema.- Returns:
The input data frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence.
- Note:
If you only require a generic data frame for the type checker, consider using
typing.cast()instead of this method.- Attention:
For lazy frames, casting is not performed eagerly. This prevents collecting the lazy frame’s schema but also means that a call to
collect()further down the line might fail because of the cast and/or missing columns.
- classmethod create_empty(*, lazy: Literal[False] = False) DataFrame[Self][source]¶
- classmethod create_empty(*, lazy: Literal[True]) LazyFrame[Self]
- classmethod create_empty(*, lazy: bool) DataFrame[Self] | LazyFrame[Self]
Create an empty data or lazy frame from this schema.
- Args:
- lazy: Whether to create a lazy data frame. If
True, returns a lazy frame with this Schema. Otherwise, returns an eager frame.
- lazy: Whether to create a lazy data frame. If
- Returns:
An instance of
polars.DataFrameorpolars.LazyFramewith this schema’s defined columns and their data types.
- classmethod create_empty_if_none(df: DataFrame[Self] | None, *, lazy: Literal[False] = False) DataFrame[Self][source]¶
- classmethod create_empty_if_none(df: LazyFrame[Self] | None, *, lazy: Literal[True]) LazyFrame[Self]
- classmethod create_empty_if_none(df: DataFrame[Self] | LazyFrame[Self] | None, *, lazy: bool) DataFrame[Self] | LazyFrame[Self]
Impute
Noneinput with an empty, schema-compliant lazy or eager data frame or return the input as lazy or eager frame.- Args:
- df: The data frame to check for
None. If it is notNone, it is returned as lazy or eager frame. Otherwise, a schema-compliant data or lazy frame with no rows is returned.
- lazy: Whether to return a lazy data frame. If
True, returns a lazy frame with this Schema. Otherwise, returns an eager frame.
- df: The data frame to check for
- Returns:
The given data frame
dfas lazy or eager frame, if it is notNone. An instance ofpolars.DataFrameorpolars.LazyFramewith this schema’s defined columns and their data types, but no rows, otherwise.
- classmethod filter(df: DataFrame | LazyFrame, /, *, cast: bool = False) tuple[DataFrame[Self], FailureInfo[Self]][source]¶
Filter the data frame by the rules of this schema.
This method can be thought of as a “soft alternative” to
validate(). Whilevalidate()raises an exception when a row does not adhere to the rules defined in the schema, this method simply filters out these rows and succeeds.- Args:
- df: The data frame to filter for valid rows. The data frame is collected
within this method, regardless of whether a
DataFrameorLazyFrameis passed.- cast: Whether columns with a wrong data type in the input data frame are
cast to the schema’s defined data type if possible. Rows for which the cast fails for any column are filtered out.
- Returns:
A tuple of the validated rows in the input data frame (potentially empty) and a simple dataclass carrying information about the rows of the data frame which could not be validated successfully. Just like in polars’ native
filter(), the order of rows in the returned data frame is maintained.- Raises:
- ValidationError: If the columns of the input data frame are invalid. This
happens only if the data frame misses a column defined in the schema or a column has an invalid dtype while
castis set toFalse.
- Note:
This method preserves the ordering of the input data frame.
- classmethod is_valid(df: DataFrame | LazyFrame, /, *, cast: bool = False) bool[source]¶
Utility method to check whether
validate()raises an exception.- Args:
df: The data frame to check for validity. allow_extra_columns: Whether to allow the data frame to contain columns
that are not defined in the schema.
- cast: Whether columns with a wrong data type in the input data frame are
cast to the schema’s defined data type before running validation. If set to
False, a wrong data type will result in a return value ofFalse.
- Returns:
Whether the provided dataframe can be validated with this schema.
- classmethod matches(other: type[Schema]) bool[source]¶
Check whether this schema semantically matches another schema.
This method checks whether the schemas have the same columns (with the same data types and constraints) as well as the same rules.
- Args:
other: The schema to compare with.
- Returns:
Whether the schemas are semantically equal.
- classmethod polars_schema() Schema[source]¶
Obtain the polars schema for this schema.
- Returns:
A
polarsschema that mirrors the schema defined by this class.
- classmethod primary_keys() list[str][source]¶
The primary key columns in this schema (possibly empty).
- classmethod pyarrow_schema() pa.Schema[source]¶
Obtain the pyarrow schema for this schema.
- Returns:
A
pyarrowschema that mirrors the schema defined by this class.
- classmethod read_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) DataFrame[Self][source]¶
Read a Delta Lake table into a typed data frame with this schema.
Compared to
polars.read_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.- Args:
source: Path or DeltaTable object from which to read the data. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it withpolars.read_delta()to convey the purpose better_.
kwargs: Additional keyword arguments passed directly to
polars.read_delta().- Returns:
The data frame with this schema.
- Raises:
ValidationRequiredError: If no schema information can be read from the source and
validationis set to"forbid".- Attention:
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
This method suffers from the same limitations as
serialize().
- classmethod read_parquet(source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes], *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) DataFrame[Self][source]¶
Read a parquet file into a typed data frame with this schema.
Compared to
polars.read_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.- Args:
source: Path, directory, or file-like object from which to read the data. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it withpolars.read_parquet()to convey the purpose better_.
- kwargs: Additional keyword arguments passed directly to
polars.read_parquet().
- Returns:
The data frame with this schema.
- Raises:
- ValidationRequiredError: If no schema information can be read from the
source and
validationis set to"forbid".
- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- classmethod sample(num_rows: int | None = None, *, overrides: Mapping[str, Iterable[Any]] | Sequence[Mapping[str, Any]] | None = None, generator: Generator | None = None) DataFrame[Self][source]¶
Create a random data frame with a predefined number of rows.
Generally, this method should only be used for testing. Also, if you want to generate _realistic_ test data, it is inevitable to implement your custom sampling logic (by making use of the
Generatorclass).In order to allow for sampling random data frames in the presence of custom rules and primary key constraints, this method performs fuzzy sampling: it samples in a loop until it finds a data frame of length
num_rowswhich adhere to the schema. The maximum number of sampling rounds is configured viamax_sampling_iterationsin theConfigclass. By fixing this setting to 1, it is only possible to reliably sample from schemas without custom rules and without primary key constraints.- Args:
- num_rows: The (optional) number of rows to sample for creating the random
data frame. Must be provided (only) if no
overridesare provided. If this isNone, the number of rows in the data frame is determined by the length of the values inoverrides.- overrides: Fixed values for a subset of the columns of the sampled data
frame. Just like when initializing a
polars.DataFrame, overrides may either be provided as “column-” or “row-layout”, i.e. via a mapping or a list of mappings, respectively. The number of rows in the result data frame is equal to the length of the values inoverrides. If bothoverridesandnum_rowsare provided, the length of the values inoverridesmust be equal tonum_rows. The order of the items is guaranteed to match the ordering in the returned data frame. When providing values for a column, no sampling is performed for that column.- generator: The (seeded) generator to use for sampling data. If
None, a generator with random seed is automatically created.
- Returns:
A data frame valid under the current schema with a number of rows that matches the length of the values in
overridesornum_rows.- Raises:
- ValueError: If
num_rowsis not equal to the length of the values in overrides.- ValueError: If no valid data frame can be found in the configured maximum
number of iterations.
- ValueError: If
- Attention:
Be aware that, due to sampling in a loop, the runtime of this method can be significant for complex schemas. Consider passing a seeded generator and evaluate whether the runtime impact in the tests is bearable. Alternatively, it can be beneficial to provide custom column overrides for columns associated with complex validation rules.
- classmethod scan_delta(source: str | Path | deltalake.DeltaTable, *, validation: Validation = 'warn', **kwargs: Any) LazyFrame[Self][source]¶
Lazily read a Delta Lake table into a typed data frame with this schema.
Compared to
polars.scan_delta(), this method checks the table’s metadata and runs validation if necessary to ensure that the data matches this schema.- Args:
source: Path or DeltaTable object from which to read the data. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it withpolars.scan_delta()to convey the purpose better_.
kwargs: Additional keyword arguments passed directly to
polars.scan_delta().- Returns:
The lazy data frame with this schema.
- Raises:
ValidationRequiredError: If no schema information can be read from the source and
validationis set to"forbid".- Attention:
Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
This method suffers from the same limitations as
serialize().
- classmethod scan_parquet(source: str | Path | IO[bytes] | bytes | list[str] | list[Path] | list[IO[bytes]] | list[bytes], *, validation: Literal['allow', 'forbid', 'warn', 'skip'] = 'warn', **kwargs: Any) LazyFrame[Self][source]¶
Lazily read a parquet file into a typed data frame with this schema.
Compared to
polars.scan_parquet(), this method checks the parquet file’s metadata and runs validation if necessary to ensure that the data matches this schema.- Args:
source: Path, directory, or file-like object from which to read the data. validation: The strategy for running validation when reading the data:
"allow"`: The method tries to read the parquet file's metadata. If the stored schema matches this schema, the data frame is read without validation. If the stored schema mismatches this schema or no schema information can be found in the metadata, this method automatically runs :meth:`validate` with ``cast=True."warn"`: The method behaves similarly to ``"allow". However, it prints a warning if validation is necessary."forbid": The method never runs validation automatically and only returns if the schema stored in the parquet file’s metadata matches this schema."skip": The method never runs validation and simply reads the parquet file, entrusting the user that the schema is valid. _Use this option carefully and consider replacing it withpolars.scan_parquet()to convey the purpose better_.
- kwargs: Additional keyword arguments passed directly to
polars.scan_parquet().
- Returns:
The data frame with this schema.
- Raises:
- ValidationRequiredError: If no schema information can be read from the
source and
validationis set to"forbid".
- Note:
Due to current limitations in dataframely, this method actually reads the parquet file into memory if
validationis"warn"or"allow"and validation is required.- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- classmethod serialize() str[source]¶
Serialize this schema to a JSON string.
- Returns:
The serialized schema.
- Note:
Serialization within dataframely itself will remain backwards-compatible at least within a major version. Until further notice, it will also be backwards-compatible across major versions.
- Attention:
Serialization of
polarsexpressions is not guaranteed to be stable across versions of polars. This affects schemas that define custom rules or columns with custom checks: a schema serialized with one version of polars may not be deserializable with another version of polars.- Attention:
This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.
- Raises:
TypeError: If any column contains metadata that is not JSON-serializable. ValueError: If any column is not a “native” dataframely column type but
a custom subclass.
- classmethod sink_parquet(lf: LazyFrame[Self], /, file: str | Path | IO[bytes] | PartitioningScheme, **kwargs: Any) None[source]¶
Stream a typed lazy frame with this schema to a parquet file.
This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by
read_parquet()andscan_parquet()for more efficient reading, or by external tools.- Args:
lf: The lazy frame to write to the parquet file. file: The file path, writable file-like object, or partitioning scheme to
which to write the parquet file.
- kwargs: Additional keyword arguments passed directly to
polars.write_parquet().metadatamay only be provided if it is a dictionary.
- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- classmethod sql_schema(dialect: sa.Dialect) list[sa.Column][source]¶
Obtain the SQL schema for a particular dialect for this schema.
- Args:
- dialect: The dialect for which to obtain the SQL schema. Note that column
datatypes may differ across dialects.
- Returns:
A list of
sqlalchemycolumns that can be used to create a table with the schema as defined by this class.
- classmethod validate(df: DataFrame | LazyFrame, /, *, cast: bool = False) DataFrame[Self][source]¶
Validate that a data frame satisfies the schema.
- Args:
df: The data frame to validate. cast: Whether columns with a wrong data type in the input data frame are
cast to the schema’s defined data type if possible.
- Returns:
The (collected) input data frame, wrapped in a generic version of the input’s data frame type to reflect schema adherence. The data frame is guaranteed to maintain its order.
- Raises:
- ValidationError: If the input data frame does not satisfy the schema
definition.
- Note:
This method _always_ collects the input data frame in order to raise potential validation errors.
- classmethod write_delta(df: DataFrame[Self], /, target: str | Path | deltalake.DeltaTable, **kwargs: Any) None[source]¶
Write a typed data frame with this schema to a Delta Lake table.
This method automatically adds a serialization of this schema to the Delta Lake table as metadata. The metadata can be leveraged by
read_delta()andscan_delta()for efficient reading or by external tools.- Args:
df: The data frame to write to the Delta Lake table. target: The path or DeltaTable object to which to write the data. kwargs: Additional keyword arguments passed directly to
polars.write_delta().- Attention:
This method suffers from the same limitations as
serialize().Schema metadata is stored as custom commit metadata. Only the schema information from the last commit is used, so any table modifications that are not through dataframely will result in losing the metadata.
Be aware that appending to an existing table via mode=”append” may result in violation of group constraints that dataframely cannot catch without re-validating. Only use appends if you are certain that they do not break your schema.
- classmethod write_parquet(df: DataFrame[Self], /, file: str | Path | IO[bytes], **kwargs: Any) None[source]¶
Write a typed data frame with this schema to a parquet file.
This method automatically adds a serialization of this schema to the parquet file as metadata. This metadata can be leveraged by
read_parquet()andscan_parquet()for more efficient reading, or by external tools.- Args:
df: The data frame to write to the parquet file. file: The file path or writable file-like object to which to write the
parquet file. This should be a path to a directory if writing a partitioned dataset.
- kwargs: Additional keyword arguments passed directly to
polars.write_parquet().metadatamay only be provided if it is a dictionary.
- Attention:
Be aware that this method suffers from the same limitations as
serialize().
- dataframely.schema.deserialize_schema(data: str, strict: Literal[True] = True) type[Schema][source]¶
- dataframely.schema.deserialize_schema(data: str, strict: Literal[False]) type[Schema] | None
Deserialize a schema from a JSON string.
This method allows to dynamically load a schema from its serialization, without having to know the schema to load in advance.
- Args:
data: The JSON string created via
Schema.serialize(). strict: Whether to raise an exception if the schema cannot be deserialized.- Returns:
The schema loaded from the JSON data.
- Raises:
ValueError: If the schema format version is not supported and
strict=True.- Attention:
This functionality is considered unstable. It may be changed at any time without it being considered a breaking change.
- See also:
Schema.serialize()for additional information on serialization.
- dataframely.schema.read_parquet_metadata_schema(source: str | Path | IO[bytes] | bytes) type[Schema] | None[source]¶
Read a dataframely schema from the metadata of a parquet file.
- Args:
source: Path to a parquet file or a file-like object that contains the metadata.
- Returns:
The schema that was serialized to the metadata.
Noneif no schema metadata is found or the deserialization fails.