Learn how to go beyond simple schema changes and manage real data transformations with Alembic migrations. This post covers practical techniques for updating enums, denormalizing tables, and handling complex data updates.
In most backend projects, database migrations are treated as structural like adding a column here, renaming a table there, dropping an index, and so on. These are the kinds of changes your migration tool like Alembic for SQLAlchemy or Django's migration system handles really well. But what happens when just changing the schema isn't enough?
What if the data itself needs to change along with the schema?
What if you're adding a non-nullable column but need to backfill historical values?
Or updating enum types?
Or normalizing denormalized legacy fields?
This is where data-backed migrations come in.
In this post, we’ll cover:
Let’s get into it.
A data-backed migration is a database migration that modifies existing data as part of a structural schema change. Most schema changes like ALTER TABLE
don’t touch any rows. They just change how the table is defined. But sometimes, after you make a schema change, your existing data no longer fits the new structure, or you want to prepare it for the new structure. That’s where you write custom logic to update the data inside your migration script, usually using SQL or your ORM.
Here are some common use cases:
Use Case | Example |
---|---|
New column with historical values | Add is_active column and set it to true for all current users |
Enum updates | Rename enum values in existing rows |
JSON refactoring | Change nested keys in JSON fields |
Data normalization | Move comma-separated values into a new association table |
Soft delete | Convert a deleted_at IS NULL pattern into a proper is_deleted boolean |
You might be thinking: “Can’t I just run a one-time Python or SQL script to fix the data?” Yes, but that means you're bypassing your migration history. The main advantage of doing it inside a migration is versioning and repeatability. Everyone on your team and your CI/CD pipeline can apply the same transformation consistently. Plus, one-off scripts often get lost. Migrations don’t.
Let’s say we want to add a new column called is_premium
to a users
table and set it to True
for any user with more than 10 orders.
In an Alembic migration:
op.add_column('users', sa.Column('is_premium', sa.Boolean(), nullable=True))
If we stop here, the new column will exist, but it’ll be empty (NULL
) for all existing users.
Here’s how we update the data inside the same migration:
from sqlalchemy.sql import table, column, text
from sqlalchemy import Boolean, Integer
def upgrade():
# Step 1: Add the column
op.add_column('users', sa.Column('is_premium', sa.Boolean(), nullable=True))
# Step 2: Fill in existing data
op.execute("""
UPDATE users
SET is_premium = TRUE
WHERE id IN (
SELECT user_id FROM orders
GROUP BY user_id
HAVING COUNT(*) > 10
)
""")
# Step 3: Make it NOT NULL (optional)
op.alter_column('users', 'is_premium', nullable=False, server_default='FALSE')
Now when someone runs this migration, the schema and the data change together. Future developers won’t be confused. Everyone stays in sync.
Dealing with PostgreSQL enums during migrations is often a headache. The experience is talking. Unlike standard columns, enums are part of the database type system, which means:
Because of this, it's important to handle enum migrations safely and repeatably.
Let’s say we want to change the values in an existing PostgreSQL enum field called status
, used in a tasks
table like below.
Original Enum: 'todo'
, 'in_progress'
, 'done'
New Enum: 'pending'
, 'active'
, 'complete'
Mapping:
'todo' → 'pending'
'in_progress' → 'active'
'done' → 'complete'
The Hard Way (Manually): You can do all these steps manually (as shown earlier), but this becomes error-prone if you do it often or across multiple tables/columns.
ALTER TYPE status RENAME VALUE 'todo' TO 'pending';
And if that fails on Postgres < 10, you might need to:
Yes, it’s messy, but that's real life. And you should keep it versioned in your migrations.
So instead...
Reusable upgrade_enum
Helper: We can abstract all the steps into a reusable function like below.
def upgrade_enum(
op,
table: str,
column: str,
enum_name: str,
old_options: list[str],
new_options: list[str],
old_to_new_mapping: dict[str, str] | None = None,
):
tmp_name = f"tmp_{enum_name}"
tmp_options = set([*old_options, *new_options])
old_type = sa.Enum(*old_options, name=enum_name)
new_type = sa.Enum(*new_options, name=enum_name)
tmp_type = sa.Enum(*tmp_options, name=tmp_name)
# Step 1: Create a temporary enum type with all options
tmp_type.create(op.get_bind(), checkfirst=False)
# Step 2: Alter the column to use the temporary enum
op.execute(
f"ALTER TABLE {table} ALTER COLUMN {column} TYPE {tmp_name} USING {column}::text::{tmp_name}"
)
# Step 3: Optionally convert old values to new values
if old_to_new_mapping:
for old, new in old_to_new_mapping.items():
op.execute(f"UPDATE {table} SET {column}='{new}' WHERE {column}='{old}'")
# Step 4: Drop old enum type and create new one
old_type.drop(op.get_bind(), checkfirst=False)
new_type.create(op.get_bind(), checkfirst=False)
# Step 5: Migrate from temporary enum to new enum
op.execute(
f"ALTER TABLE {table} ALTER COLUMN {column} TYPE {enum_name} USING {column}::text::{enum_name}"
)
# Step 6: Drop the temporary enum type
tmp_type.drop(op.get_bind(), checkfirst=False)
Don’t forget, you also need to create downgrade_enum
function.
Now we can change our enums easily!
from alembic import op
import sqlalchemy as sa
from myapp.utils import upgrade_enum
def upgrade():
upgrade_enum(
op=op,
table="tasks",
column="status",
enum_name="status",
old_options=["todo", "in_progress", "done"],
new_options=["pending", "active", "complete"],
old_to_new_mapping={
"todo": "pending",
"in_progress": "active",
"done": "complete"
}
)
Suppose you have a settings
JSON field in the users
table like this:
{
"email_notifications": true,
"dark_mode": false
}
And now you want to split these into dedicated columns: email_notifications
, dark_mode
.
In your migration:
op.add_column('users', sa.Column('email_notifications', sa.Boolean()))
op.add_column('users', sa.Column('dark_mode', sa.Boolean()))
op.execute("""
UPDATE users
SET email_notifications = (settings->>'email_notifications')::boolean,
dark_mode = (settings->>'dark_mode')::boolean
""")
You just extracted structured data from semi-structured JSON and migrated it into proper schema. That’s exactly what data-backed migrations are for.
Let’s say you used to store tags as a comma-separated string:
blog_posts.tags = "python,backend,orm"
Now you want to normalize them into a post_tags
association table.
You’d:
That backfill logic needs custom Python (not just SQL). You might write a helper in your migration that:
post_tags
with proper FK referencesYes, it’s slower than raw SQL, but it ensures consistency.
Never scan millions of rows in a single transaction during a migration. You risk:
Use batching if needed like loop over 10k rows per commit
If your migration is re-run, will it break things? If yes, that’s risky.
For example:
UPDATE users SET role = 'admin' WHERE role = 'superuser'
What happens if you run this again? Nothing bad, but it’s safe because the condition limits it.
WHERE
clausesINSERT ... ON CONFLICT DO NOTHING
Data-backed migrations are the real workhorses of a mature backend system.
They're:
They require more thought and care, but they give you consistency, traceability, and confidence in return. So the next time you're adding a column, updating enums, or reworking relationships ask yourself: "Does the data need to evolve, too?" If yes, it’s time for a data-backed migration.