Handling Data in Alembic Migrations When Schema Changes Aren’t Enough

Learn how to go beyond simple schema changes and manage real data transformations with Alembic migrations. This post covers practical techniques for updating enums, denormalizing tables, and handling complex data updates.

by Heval Hazal Kurt in Programming on June 06, 2025

In most backend projects, database migrations are treated as structural like adding a column here, renaming a table there, dropping an index, and so on. These are the kinds of changes your migration tool like Alembic for SQLAlchemy or Django's migration system handles really well. But what happens when just changing the schema isn't enough?

What if the data itself needs to change along with the schema?

What if you're adding a non-nullable column but need to backfill historical values?

Or updating enum types?

Or normalizing denormalized legacy fields?

This is where data-backed migrations come in.

In this post, we’ll cover:

What data-backed migrations are and why they matter
When and how to write them
Real-world backend use cases
Pitfalls and best practices
How to keep them safe and maintainable

Let’s get into it.

What Is a Data-Backed Migration?

A data-backed migration is a database migration that modifies existing data as part of a structural schema change. Most schema changes like ALTER TABLE don’t touch any rows. They just change how the table is defined. But sometimes, after you make a schema change, your existing data no longer fits the new structure, or you want to prepare it for the new structure. That’s where you write custom logic to update the data inside your migration script, usually using SQL or your ORM.

Common Scenarios

Here are some common use cases:

Use Case	Example
New column with historical values	Add `is_active` column and set it to `true` for all current users
Enum updates	Rename enum values in existing rows
JSON refactoring	Change nested keys in JSON fields
Data normalization	Move comma-separated values into a new association table
Soft delete	Convert a `deleted_at IS NULL` pattern into a proper `is_deleted` boolean

Why Not Just Use a Script?

You might be thinking: “Can’t I just run a one-time Python or SQL script to fix the data?” Yes, but that means you're bypassing your migration history. The main advantage of doing it inside a migration is versioning and repeatability. Everyone on your team and your CI/CD pipeline can apply the same transformation consistently. Plus, one-off scripts often get lost. Migrations don’t.

Writing a Simple Data-Backed Migration

Let’s say we want to add a new column called is_premium to a users table and set it to True for any user with more than 10 orders.

Step 1 → Schema Change

In an Alembic migration:

op.add_column('users', sa.Column('is_premium', sa.Boolean(), nullable=True))

If we stop here, the new column will exist, but it’ll be empty (NULL) for all existing users.

Step 2 → Data Backfill

Here’s how we update the data inside the same migration:

from sqlalchemy.sql import table, column, text
from sqlalchemy import Boolean, Integer

def upgrade():
    # Step 1: Add the column
    op.add_column('users', sa.Column('is_premium', sa.Boolean(), nullable=True))

    # Step 2: Fill in existing data
    op.execute("""
        UPDATE users
        SET is_premium = TRUE
        WHERE id IN (
            SELECT user_id FROM orders
            GROUP BY user_id
            HAVING COUNT(*) > 10
        )
    """)

    # Step 3: Make it NOT NULL (optional)
    op.alter_column('users', 'is_premium', nullable=False, server_default='FALSE')

Now when someone runs this migration, the schema and the data change together. Future developers won’t be confused. Everyone stays in sync.

Advanced Use Cases

1. Enum Value Changes

Dealing with PostgreSQL enums during migrations is often a headache. The experience is talking. Unlike standard columns, enums are part of the database type system, which means:

You can't drop an enum type if a column still uses it.
Changing enum values like renaming or adding options usually requires intermediate steps.
Adding new enum values is even more harder.
Enums are cached in PostgreSQL, so you must be very deliberate when altering them.

Because of this, it's important to handle enum migrations safely and repeatably.

Let’s say we want to change the values in an existing PostgreSQL enum field called status, used in a tasks table like below.

Original Enum: 'todo', 'in_progress', 'done'

New Enum: 'pending', 'active', 'complete'

Mapping:

'todo' → 'pending'
'in_progress' → 'active'
'done' → 'complete'

The Hard Way (Manually): You can do all these steps manually (as shown earlier), but this becomes error-prone if you do it often or across multiple tables/columns.

ALTER TYPE status RENAME VALUE 'todo' TO 'pending';

And if that fails on Postgres < 10, you might need to:

Create a new enum
Update rows to use the new values
Alter the column
Drop the old type

Yes, it’s messy, but that's real life. And you should keep it versioned in your migrations.

So instead...

Reusable upgrade_enum Helper: We can abstract all the steps into a reusable function like below.

def upgrade_enum(
    op,
    table: str,
    column: str,
    enum_name: str,
    old_options: list[str],
    new_options: list[str],
    old_to_new_mapping: dict[str, str] | None = None,
):
    tmp_name = f"tmp_{enum_name}"
    tmp_options = set([*old_options, *new_options])

    old_type = sa.Enum(*old_options, name=enum_name)
    new_type = sa.Enum(*new_options, name=enum_name)
    tmp_type = sa.Enum(*tmp_options, name=tmp_name)

    # Step 1: Create a temporary enum type with all options
    tmp_type.create(op.get_bind(), checkfirst=False)

    # Step 2: Alter the column to use the temporary enum
    op.execute(
        f"ALTER TABLE {table} ALTER COLUMN {column} TYPE {tmp_name} USING {column}::text::{tmp_name}"
    )

    # Step 3: Optionally convert old values to new values
    if old_to_new_mapping:
        for old, new in old_to_new_mapping.items():
            op.execute(f"UPDATE {table} SET {column}='{new}' WHERE {column}='{old}'")

    # Step 4: Drop old enum type and create new one
    old_type.drop(op.get_bind(), checkfirst=False)
    new_type.create(op.get_bind(), checkfirst=False)

    # Step 5: Migrate from temporary enum to new enum
    op.execute(
        f"ALTER TABLE {table} ALTER COLUMN {column} TYPE {enum_name} USING {column}::text::{enum_name}"
    )

    # Step 6: Drop the temporary enum type
    tmp_type.drop(op.get_bind(), checkfirst=False)

Don’t forget, you also need to create downgrade_enum function.

Now we can change our enums easily!

from alembic import op
import sqlalchemy as sa
from myapp.utils import upgrade_enum  

def upgrade():
    upgrade_enum(
        op=op,
        table="tasks",
        column="status",
        enum_name="status",
        old_options=["todo", "in_progress", "done"],
        new_options=["pending", "active", "complete"],
        old_to_new_mapping={
            "todo": "pending",
            "in_progress": "active",
            "done": "complete"
        }
    )

2. Splitting JSON Fields

Suppose you have a settings JSON field in the users table like this:

{
  "email_notifications": true,
  "dark_mode": false
}

And now you want to split these into dedicated columns: email_notifications, dark_mode.

In your migration:

op.add_column('users', sa.Column('email_notifications', sa.Boolean()))
op.add_column('users', sa.Column('dark_mode', sa.Boolean()))

op.execute("""
    UPDATE users
    SET email_notifications = (settings->>'email_notifications')::boolean,
        dark_mode = (settings->>'dark_mode')::boolean
""")

You just extracted structured data from semi-structured JSON and migrated it into proper schema. That’s exactly what data-backed migrations are for.

3. Denormalization or Re-architecture

Let’s say you used to store tags as a comma-separated string:

blog_posts.tags = "python,backend,orm"

Now you want to normalize them into a post_tags association table.

You’d:

Create the new table
Backfill the tags
Remove the old column

That backfill logic needs custom Python (not just SQL). You might write a helper in your migration that:

Reads each blog post
Splits the tags
Inserts into post_tags with proper FK references

Yes, it’s slower than raw SQL, but it ensures consistency.

Pitfalls & Best Practices

Avoid Long-Running Data Updates

Never scan millions of rows in a single transaction during a migration. You risk:

Locking tables
Transaction timeouts
Failed deployments

Use batching if needed like loop over 10k rows per commit

Keep Migrations Idempotent Where Possible

If your migration is re-run, will it break things? If yes, that’s risky.

For example:

UPDATE users SET role = 'admin' WHERE role = 'superuser'

What happens if you run this again? Nothing bad, but it’s safe because the condition limits it.

Always add WHERE clauses
Use INSERT ... ON CONFLICT DO NOTHING
Avoid hard deletes unless you’re sure

Use Raw SQL Where Needed, ORM Where Safe

Use raw SQL for performance-critical updates
Use ORM logic if you need Python-side logic like parsing, serialization

Final Thoughts

Data-backed migrations are the real workhorses of a mature backend system.

They're:

Safer than running one-off scripts
More maintainable than ad hoc changes
Essential when schema changes depend on data or vice versa

They require more thought and care, but they give you consistency, traceability, and confidence in return. So the next time you're adding a column, updating enums, or reworking relationships ask yourself: "Does the data need to evolve, too?" If yes, it’s time for a data-backed migration.