Generating SQL schema definitions#

It is often useful to store data in a SQL-based database server. dataframely aims to make this easy by providing a simple mechanism for translating your dataframely schemas to SQL table definitions.

There are many different flavors of SQL syntax. To avoid reinventing the wheel, we use sqlalchemy as an abstraction layer between python and SQL.

Individual tables#

The main functionality dataframely offers is that it converts your dy.Schema to a collection of sqlalchemy.Column:

import dataframely as dy
import sqlalchemy as sa


class MySchema(dy.Schema):
    x = dy.Int64(primary_key=True)
    y = dy.String(nullable=False)


engine = sa.create_engine(...)
columns: list[sa.Column] = MySchema.to_sqlalchemy_columns(engine.dialect)

You can then do with the columns what you please. Most likely, you want to create a table with them:

my_table = sa.Table("myTable", sa.MetaData(), *columns)
my_table.create(engine)

You can also inspect the SQL code that sqlalchemy would execute:

from sqlalchemy.schema import CreateTable

print(CreateTable(my_table).compile())

In the example case, this renders to:

CREATE TABLE "myTable"
(
    x BIGINT  NOT NULL,
    y VARCHAR NOT NULL,
    PRIMARY KEY (x)
)

Uploading data can then be handled by polars.DataFrame.write_database():

df: dy.DataFrame[MySchema]

df.write_database(
    connection=engine,
    table_name=my_table.name,
    if_table_exists="append"
)

Note

Why do you need to pass in the SQL dialect? Even though sqlalchemy handles most dialect dependencies, we sometimes still need to intervene. For example, when using Microsoft SQL Server, sqlalchemy will render the sqlalchemy.Date type into a raw SQL DATETIME, while we think that DATE would be more appropriate.

Note

Implementation: The choice of sqlalchemy type is implemented in sqlalchemy_dtype(), which is overwritten by each of the subtypes of Column. For example, the implementation for Date is sqlalchemy_dtype().

Note

Constraints: The nullability and primary key constraints you define in dataframely are translated to SQL. Custom filters and rules are not.

Note

Length of string columns: For string columns, dataframely will attempt to pass information about the maximal length into the SQL definition. This is trivial if max_length is set. Otherwise, if a regex is provided, the maximal length of the string is inferred from the regular expression if possible. Note that having inferable maximal lengths can be particularly important for primary key columns. Some database systems, such as Microsoft SQL Server, do not allow VARCHAR(max) columns (unbounded strings) to be used as primary keys.

Native SQL enums#

By default, Enum maps to sa.CHAR or sa.String columns so stored values remain plain strings. You may set sqlalchemy_use_enum=True to instead generate native enums:

from enum import Enum, auto

import sqlalchemy as sa
import dataframely as dy
from sqlalchemy.dialects.postgresql import dialect as pg_dialect
from sqlalchemy.dialects.mssql import dialect as mssql_dialect


class Status(str, Enum):
    PENDING = auto()
    APPROVED = auto()


class Staged(dy.Schema):
    status = dy.Enum(Status, sqlalchemy_use_enum=True)

This will translate the ~dataframely.Enum to a ~sqlalchemy.Enum:

>>> Staged.to_sqlalchemy_columns(pg_dialect())
[Column('status', Enum('1', '2', name='status'), table=None, nullable=False)]

Depending on the database dialect you use, sqlalchemy will render this accordingly. For example, postgresql supports native enums, and sqlalchemy will create a native enum column, while in MSSQL, where this is not supported, it will fall back to VARCHAR.

When categories is a Python enum.Enum subclass, sqlalchemy uses the enum class name (lowercased) as the database enum type name. For string category lists, the SQL column name is used by default; override it with sqlalchemy_enum_name if needed.

Collections of multiple tables#

If you have an entire dy.Collection, it’s also easy to generate one table for each member table of the collection. sqlalchemy.MetaData is a commonly used container in such scenarios:

MyCollection: dy.Collection
meta = sa.MetaData()
for name, dy_schema in MyCollection.member_schemas().items():
    sa.Table(
        name,
        meta,
        *dy_schema.to_sqlalchemy_columns(dialect=engine.dialect),
    )
meta.create_all()