How Levels.fyi Built Scalable Search with PostgreSQL

Our philosophy to scaling is simple, avoid premature optimization

At Levels.fyi, we understand the importance of providing users with a robust search function that enables them to quickly find the information they need. This post will walk you through our implementation of a scalable fuzzy search solution using PostgreSQL, which handles over 10 million search queries per month with a p99 query performance under 20ms.

Search Demo

Search Types

Type Count
Company 35K+
City 6.2K+
Area 300+
Country 180+
Title 100+
Specialization 40+
Search

Individually, total record count is more than 42K records across six tables. Besides these, we also support compound search types such as Company x Title (Google Software Engineer) and Title x Company (Software Engineer Google / Software Engineer at Google), resulting in hundreds of thousands of additional records.

Search Demo

Starting Simple

Take the example of someone trying to search for JPMorgan Chase using a query like jp. A starting SQL query might look something like this:

@set q = 'jp%'
SELECT c.name FROM company c WHERE c.name ILIKE $q
UNION SELECT c.name FROM city c WHERE c.name ILIKE $q
...
ORDER BY name
LIMIT 10;

While its a decent start, ranking is not taken into consideration here. But we’ll address that later. Above, the query is performing a sequential scan on all the tables. To optimize this, we can simply add a lowercase BTREE index over the searched column and query using LIKE instead of ILIKE. Without normalizing the case, it would be impossible for the database to use the index.

For efficiency during lookups, we create lowercase indexes:

CREATE INDEX company_name_lower_index ON company(LOWER(name));
CREATE INDEX city_name_lower_index ON city(LOWER(name));
...

⚠️ Using CREATE INDEX will lock writes and updates on the table. If this is an issue, use CREATE INDEX CONCURRENTLY (though it is more expensive). Read More

Optimizing Further

Creating so many indices and aggregating so many tables is sub-optimal. To optimize this, we employ materialized views, which create a separate disk-based entity and hence support indexing. The only downside is that we have to keep it updated.

CREATE MATERIALIZED VIEW search_view AS
ㅤㅤSELECT c.name FROM company c UNION
ㅤㅤSELECT c.name FROM city c UNION ...;
CREATE INDEX search_view_name_lower_index
ㅤㅤON search_view(LOWER(name));
Tom

Now, our search query would look something like this:

@set q = 'jp%'
SELECT s.name FROM search_view s
WHERE LOWER(s.name) LIKE LOWER($q)
ORDER BY s.name;

To keep the materialized view synchronized, we use an Event Bridge rule that invokes a lambda every X minutes refreshing the materialized view. This means our search results could be stale for a maximum of X minutes, but they would still eventually be consistent.

⚠️ Implementing INDEX ONLY SCAN by including all selected columns in the index can reduce the need for a heap lookup, but this approach does have its limitations. Read More

Core Issues

Our search does have limitations. For instance, a search for jp morgan% wouldn't yield any results because the name we stored is JPMorgan Chase. Even a prefix + suffix query - %jp morgan% wouldn't help here. Expanding the search using arbitrary wildcards %jp%morgan% would help, but it would still fail to deliver results for a search like jp chase%. Phonetics and language present another challenge. If a user searches for the sofy, they could mean sofi, since the is a stop word and i can be misheard as y.

Wildcard searches are helpful, but as the queries become more arbitrary, these searches start to break down. Along with the diversity of our users’ searches, we also need to figure out how to rank our search results. So we decided to take a step forward.

⚠️ Non-prefix searches result in sequential scans. To efficiently search with arbitrary wildcards, you can use trigram indexing. Read More

Trigram

Direct searches on the name column are incredibly challenging. To improve search results, we switched to using full text searches with tsvector, a data type that allows for text searching.

@set q = 'the:*&jp:*&chase:*'
SELECT
ㅤㅤs.name,
ㅤㅤto_tsvector('english', s.name),
ㅤㅤto_tsquery('english', $q),
ㅤㅤts_rank(
ㅤㅤㅤㅤto_tsvector('english', s.name),
ㅤㅤㅤㅤto_tsquery('english', $q))
FROM search_view s;
name to_tsvector to_tsquery ts_rank
JP Global Digital 'digit':3 'global':2 'jp':1 'jp':* & 'chase':* 0
JPS Health Network 'health':2 'jps':1 'network':3 'jp':* & 'chase':* 0
JPMorgan Chase 'chase':2 'jpmorgan':1 'jp':* & 'chase':* 0.09910322
JP Services 'jp':1 'servic':2 'jp':* & 'chase':* 0

To eliminate the overhead of converting the name to tsvector, we embedded tsvector as a precomputed entity in the materialized view. We also used a Generalized Inverted Index (Trigram) to optimize the running time of queries. With these changes, our search query would look like this:

@set q = 'the:*&jp:*&chase:*'
SELECT s.name
FROM search_view s
ORDER BY ts_rank(
ㅤㅤs.search_vector,
ㅤㅤto_tsquery('english', $q)) DESC
LIMIT 10;

For a query like D E Shaw, the new method reviews the rank and search column:

name ts_rank
The D. E. Shaw Group 0.26832977
Eric L. Davis Engineering 0.18784232

⚠️ There are other methods for converting a query to a vector like websearch_to_tsquery, different types of ranking methods like similarity, ts_rank_cd and highlighting helpers like ts_headline that might be better suited to your use case. Read More

Search Aliases

Several companies have alternative names, either because of a business rebrand or because they have a strong connection between different names:

  • Google → Alphabet
  • Facebook → Meta
  • JPMorgan Chase → JPMC
Spiderman

Our business admins work hard to maintain a comprehensive list of possible alternative company names. To account for alternate names in our search, the search column becomes a combination of the name and its aliases, automatically accounting for them in the search results.

Tweaking Search Ranking

When we search for goo, its likely that we are trying to visit Google but query returned these instead:

name ts_rank
Goose Bay 0.06079271
Goose Prairie 0.06079271
Goodna 0.06079271
Google 0.06079271
Goodwin 0.06079271

While these results have the same rank as per lexeme matching, it doesn't provide the best user experience. To address this, we've developed our own relevancy algorithm on top of similarity and ts_rank. The question is, how do we determine which result is more likely to be clicked? One simple indicator is the number of clicks a result receives. This ensures that the more popular a result becomes, the higher it ranks.

We can't share the exact formula for our search ranking, but here are the few parameters we consider:

  • Exact match (rank #1)
  • Frequency of matching lexemes using ts_rank
  • Similarity score using similarity
  • Type of record
  • Popularity of the search result
  • Similarity between the result’s alias and query
  • Inverse of the result’s string length

This approach allows us to fine-tune our search ranking and deliver relevant results to our users.

Compound Search Types

We noticed that users were combining search types, for example, "Product Manager at Coinbase" or "Coinbase Product Manager". This posed a challenge because searching by company led users to a salary page, after which users still had to navigate to their title. To eliminate this extra step, we decided to add support for compound types, especially "Company x Title" and "Title x Company" combinations.

Since we're using a materialized view, updating its definition with the results of a cross-join with the company and title tables was straightforward. We also added some extra checks to avoid leading users to nonexistent pages.

Conclusion

We have come a long way from using client-side indexOf matching to implementing a full-blown search solution. While exploring the latest tools can be tempting, it's crucial to approach decisions with a fresh perspective, considering the trade-offs of time, cost, and control. Dedicated search tools like OpenSearch come with their own set of challenges, such as keeping data in sync, maintaining strict security policies, and managing additional infrastructure and tools.

Cash Burn

At Levels.fyi, we value PostgreSQL as highly as we value Google Sheets. By investing in PostgreSQL, we have empowered ourselves with advanced query capabilities and cost-effective, scalable solutions for our data needs. Our usage pattern and scale were well-served by PostgreSQL, and by utilizing our in-house solution, we have avoided potential expenses ranging from hundreds to thousands of dollars per month compared to alternative methods. Plus, it's great to finally be on a real database! 😂

Also check out: How Levels.fyi scaled to millions of users with Google Sheets as a backend

  Stay up to date on salary trends, career tips, and more.