Struggling with slow searches in your application? Learn how to implement optimized search functionality using PostgreSQL's TSVector and SQLAlchemy. Includes array filtering, JSON queries, and proper indexing strategies.
Ah, search and filter, the bread and butter of every backend developer's existence. Whether you're building the next IMDb or just trying to find that one obscure 80s sci-fi movie with the talking dolphin, efficient search is crucial.
In the project I was working on, I dealt with product-based data, and we wanted to offer users a comprehensive search and filtering system. Given our complex, highly relational data, this wasn’t exactly a walk in the park. We were handling search and filter queries somehow, but the tangled web of relationships was murdering performance. I’ve lost count of how many queries I’ve inspected with EXPLAIN ANALYZE
to diagnose and fix performance issues. All I remember is my brain overheating like a cheap laptop and me eventually resigning to my fate as a certified idiot by the end of the day. So, I needed to build a more efficient search and filtering system.
I started by creating a dedicated search table stripping away unnecessary data and populating only what was essential for search and filtering. Of course, this was just one piece of the puzzle. Maybe in another post, I’ll break down the entire system, but for now, I’ll use a simplified example to show you how to implement full-text search and filtering.
In this guide, we'll implement a PostgreSQL-backed search system using SQLAlchemy, covering:
Let’s roll!
FilmSearchIndex
TableFirst, we need a table to store our movie data. It’s a movie search index table that optimized for search & filter functionality, not normalized for a classic relational model. It includes titles, genres, names, text blobs, and even a tsvector
field for full-text search.
from sqlalchemy import Column, Integer, String, Float, Text, ARRAY
from sqlalchemy.dialects.postgresql import TSVECTOR
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class FilmSearchIndex(Base):
id = Column(Integer, primary_key=True)
name = Column(String, nullable=False)
alternate_names = Column(ARRAY(String)) # ["Alternative Title", "International Title"]
director = Column(String)
writer = Column(String)
producer = Column(String)
release_year = Column(Integer)
plotline = Column(Text)
genres = Column(ARRAY(String)) # ["Sci-Fi", "Comedy", "Horror"]
actors = Column(ARRAY(String)) # ["Bruce Willis", "Gary Oldman"]
tags = Column(ARRAY(String)) # ["Cult Classic", "Oscar Winner"]
budget = Column(Float)
rating = Column(Float)
search_text = Column(TSVECTOR) # For full-text search
At this stage, properly indexing your data fields is absolutely critical for query performance. Luckily, PostgreSQL offers multiple index types. Here's a battle-tested strategy you can follow:
Field | Index Type | Why? |
---|---|---|
id |
Primary Key (B-tree) | Default, fast lookups |
name , director , writer , producer |
B-tree | Exact matches, sorting |
release_year , budget , rating |
B-tree | Range queries (e.g., WHERE year > 1990 ) |
genres , actors |
GIN | Optimized for array operations (@> , && ) |
search_text |
GIN | Full-text search acceleration |
from sqlalchemy import Index
class FilmSearchIndex(Base):
id = Column(Integer, primary_key=True)
name = Column(String, nullable=False)
alternate_names = Column(ARRAY(String)) # ["Alternative Title", "International Title"]
director = Column(String)
writer = Column(String)
producer = Column(String)
release_year = Column(Integer)
plotline = Column(Text)
genres = Column(ARRAY(String)) # ["Sci-Fi", "Comedy", "Horror"]
actors = Column(ARRAY(String)) # ["Bruce Willis", "Gary Oldman"]
tags = Column(ARRAY(String)) # ["Cult Classic", "Oscar Winner"]
budget = Column(Float)
rating = Column(Float)
search_text = Column(TSVECTOR) # we'll update for full-text search soon
__table_args__ = (
Index("ix_film_search_index_name", name),
Index("ix_film_search_index_director", director),
Index("ix_film_search_index_writer", writer),
Index("ix_film_search_index_producer", producer),
Index("ix_film_search_index_release_year", release_year),
Index("ix_film_search_index_rating", rating),
Index("ix_film_search_index_budget", budget),
Index("ix_film_search_index_genres", genres, postgresql_using='gin'),
Index("ix_film_search_index_actors", actors, postgresql_using='gin'),
Index("ix_film_search_index_tags", tags, postgresql_using='gin'),
Index("ix_film_search_index_search_text", "search_text", postgresql_using='gin'),
)
Before dumping data into search_text
, we need to clean up arrays like cleaning out NULL
s, removing duplicates, lowercase strings, etc. Because array fields work just fine for filtering queries, we need to handle them carefully when incorporating them into full-text search. There are two key steps to this:
-- sanitize
CREATE OR REPLACE FUNCTION sanitize_array(anyelement) RETURNS anyelement AS $$
DECLARE
x text;
new_arr text[];
BEGIN
IF pg_typeof($1) = 'varchar[]'::regtype THEN
FOREACH x IN ARRAY $1
LOOP
new_arr := array_append(new_arr, regexp_replace(x::text, '[[:punct:]]', '', 'g'));
END LOOP;
RETURN new_arr;
ELSEIF pg_typeof($1) = 'varchar'::regtype THEN
return regexp_replace($1::text, '[[:punct:]]', ' ', 'g');
END IF;
return false;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
-- make string
CREATE OR REPLACE FUNCTION immutable_array_to_string(text[], text)
RETURNS text as $$ SELECT array_to_string($1, $2);
$$ LANGUAGE sql IMMUTABLE
search_text
with TSVectorPostgreSQL’s TSVector allows fast full-text search. We’ll concatenate all searchable fields into one search_text
column. Now let’s update our field.
from sqlalchemy import Index
class FilmSearchIndex(Base):
...
# before
search_text = Column(TSVECTOR)
# updated
search_text = Column(
TSVector(),
Computed(
"""to_tsvector('english',
search_sanitizer(coalesce(name, '')) || ' '
|| search_sanitizer(coalesce(director, '')) || ' '
|| search_sanitizer(coalesce(writer, '')) || ' '
|| search_sanitizer(coalesce(producer, '')) || ' '
|| immutable_array_to_string(coalesce(search_sanitizer(coalesce(genres, array['{}'])), '{}'), ' ') || ' '
|| immutable_array_to_string(coalesce(search_sanitizer(coalesce(actors, array['{}'])), '{}'), ' ') || ' '
|| immutable_array_to_string(coalesce(search_sanitizer(coalesce(tags, array['{}'])), '{}'), ' ') || ' '
)""",
persisted=True,
),
nullable=True,
index=True,
)
We used persisted=True
here because it stores the computed TSVector on disk instead of recalculating it every query and your database isn’t a treadmill. Trade-off? Slightly slower writes, but lightning-fast searches.
A little warning! If you're using the alembic-pydantic duo with the autogenerate command, don't forget to check your migration file. They're not particularly good at catching changes in computed fields. In such cases, you might need to manually add these changes to your migration script.
We handled the hard part. Let’s add some classic films to test.
films = [
FilmSearchIndex(
name="The Matrix",
alternate_names=["Matrix"],
director="Lana Wachowski",
genres=["Sci-Fi", "Action"],
actors=["Keanu Reeves", "Laurence Fishburne"],
release_year=1999,
plotline="A hacker discovers a dystopian reality controlled by machines.",
rating=8.7,
),
FilmSearchIndex(
name="Inception",
director="Christopher Nolan",
genres=["Sci-Fi", "Thriller"],
actors=["Leonardo DiCaprio", "Tom Hardy"],
release_year=2010,
plotline="A thief enters people's dreams to steal secrets.",
rating=8.8,
),
FilmSearchIndex(
name="Sharknado",
director="Anthony C. Ferrante",
genres=["Disaster", "Comedy", "Horror"],
actors=["Ian Ziering", "Tara Reid"],
release_year=2013,
plotline="Tornadoes fling sharks onto Los Angeles. Chaos ensues.",
rating=3.3,
),
]
session.add_all(films)
session.commit()
Before querying something let’s first look at PostgreSQL search and filter operators for when you need to find data without losing your sanity.
Operator: @@
“The 'Google' of PostgreSQL, if Google hated typos”
-- SQL
SELECT * FROM FilmSearchIndex
WHERE search_text @@ to_tsquery('english', 'dolphin & tornado');
python
# SQLAlchemy
session.query(FilmSearchIndex).filter(
FilmSearchIndex.search_text.op("@@")(func.to_tsquery('english', 'dolphin & tornado'))
)
Operator: ANY
“Is Keanu Reeves in this garbage fire of a movie?”
-- SQL
SELECT * FROM FilmSearchIndex
WHERE 'Keanu Reeves' = ANY(actors);
# SQLAlchemy
session.query(FilmSearchIndex).filter(FilmSearchIndex.actors.any('Keanu Reeves'))
Operator: @>
(Contains)
“Find films that are both 'Comedy' and ‘Horror’ (aka my life?)”
-- SQL
SELECT * FROM FilmSearchIndex
WHERE genres @> ARRAY['Comedy', 'Horror'];
# SQLAlchemy
session.query(FilmSearchIndex).filter(FilmSearchIndex.genres.contains(['Comedy', 'Horror']))
Operator: >>
“Digging into JSON like it's a Black Friday sale”
-- SQL
SELECT * FROM FilmSearchIndex
WHERE metadata->>'director' = 'Christopher Nolan';
# SQLAlchemy
session.query(FilmSearchIndex).filter(FilmSearchIndex.metadata['director'].astext == 'Christopher Nolan')
Want to try a little bit more madness?
from sqlalchemy import func, text
# JSON path madness
session.query(FilmSearchIndex).filter(func.jsonb_path_exists(FilmSearchIndex.metadata, '$.awards[*].won'))
Operator: ILIKE
“Case-insensitive search for when you can't spell ‘Benedict Cumberbatch’”
-- SQL
SELECT * FROM FilmSearchIndex
WHERE name ILIKE '%matrix%';
# SQLAlchemy
session.query(FilmSearchIndex).filter(FilmSearchIndex.name.ilike('%matrix%'))
Operator: BETWEEN
“Find films made after the 90s but before our collective hope died”
-- SQL
SELECT * FROM FilmSearchIndex
WHERE release_year BETWEEN 1990 AND 2000;
# SQLAlchemy
session.query(FilmSearchIndex).filter(FilmSearchIndex.release_year.between(1990, 2000))
Let’s first look at PostgreSQL search and filter operators for when you need to find data without losing your sanity.
Let’s query to find films matching “hacker dream”:
from sqlalchemy import text
search_query = "hacker dream"
results = session.query(FilmSearchIndex).filter(
FilmSearchIndex.search_text.op("@@")(func.to_tsquery('english', search_query))
).all()
# Returns: The Matrix (matches "hacker") + Inception (matches "dream")
Now let’s find all Sci-Fi films:
results = session.query(FilmSearchIndex).filter(
FilmSearchIndex.genres.contains(["Sci-Fi"])
).all()
# Returns: The Matrix, Inception
Let’s try combine multiple conditions and find Sci-Fi films with ratings > 8.5:
results = session.query(FilmSearchIndex).filter(
FilmSearchIndex.genres.contains(["Sci-Fi"]),
FilmSearchIndex.rating > 8.5
).all()
# Returns: Inception (8.8)
What about partial searches?
results = session.query(FilmSearchIndex).filter(
FilmSearchIndex.name.ilike("%matrix%")
).all()
# Returns: The Matrix
And there you have it, your PostgreSQL + SQLAlchemy search/filter toolkit is now fully loaded. Whether you’re hunting for cult classics or debugging performance nightmares, these operators will save you from writing SQL that looks like a cry for help.
Of course in this article, we've implemented a simple search and filtering system for clarity. In real-world applications, you'll be dealing with much more complex structures and running sophisticated queries, but once you properly understand the fundamentals, everything else becomes easier to handle.
Remember:
EXPLAIN ANALYZE
is your therapist when queries go rogue.Now go forth and query fearlessly but maybe keep pg_dump
on speed dial.