News

Keep it simple stupid: Building full text search across all your tools with Postgres

Akshay Nathan
October 03, 2019

Monolist Search

Introduction

In the beginning there were tools, oh so many tools. There were code hosting tools, project management tools, messaging tools, alerting tools – an infinite set of tools. And for developers, all these tools contained tasks: code they had to review, bugs they needed to look at, questions to respond to, incidents to triage. Around a year ago, we asked "what if there was one tool where we could manage all our tasks from all our other tools", so we built it.

As our customers started using Monolist, we found that they were often asking a common question.

"I know that Sarah made a comment about the marketing page designs, but was that in the Jira task, or in the spec Google Doc, or on my pull request where I merged the designs, or on Slack?"

Of course, each one of these tools have a search feature, but remembering where it is, or what syntax it supports is a massive pain. Monolist is already integrated with all these tools, and they all support search over their API, so we built a global search bar in Monolist. It was pretty cool! It would debounce the user's input, search any items already in Monolist, call out to every API for every integrated tool in parallel, merge and deduplicate the results, and present them to the user.

But it was slow... It was so slow, it took between 500ms and 5 seconds to deliver any results, which is unacceptable for a page load, let alone an autocomplete bar. You could already be typing "sea", and the autocomplete would only then finish the request for "s". What's more, it was so complex it was painful to debug, and painful to improve. We used a job system, Sidekiq, to make the API calls, Redis to store the result stream, and websockets to deliver them asynchronously to the user. The code was cool, but it was overburdened with complexity, which led to a slow, buggy, and ultimately poor user experience.

In this blog post, we'll walkthrough the foundation of our second attempt at search. We tried to follow one guiding principle from our first try, the age old cliche: KISS or Keep it simple, stupid. While we're not big on the "stupid" part (KIS), we've learned time and time again that complexity is the enemy, and that premature optimization, unnecessary abstractions, and cool technology for cool technology's sake are toxic to the end user experience. We'll use this principle throughout our journey, but first, let's look at the other objectives we had for the new experience.

Objectives

  1. It needs to be fast.

For an autocomplete experience to be effective, the results need to render quickly. We need the entire request, response, and render flow to take < 100ms, which is the largest amount of time an action can take while still being perceived as "instantaneous" to the user.

  1. The results need to be relevant and (mostly) comprehensive.

We talked to our users and found that while they expected breadth from the search on a frequent basis (they wanted results across all their tools), they didn't necessarily need depth as often (they weren't frequently searching for terms in the body of an email, content of Google Doc, or description of an Asana Task).

  1. And, of course, KIS.

We wanted to stay vigilant against complexity, which leads us to our next section.

What about ElasticSearch?

ElasticSearch is great. It's made for this purpose. It's unbelievably flexible. There are hosted solutions out there. But while it may have been easier to use ElasticSearch, and definitely easier to use a hosted search service like Algolia, simplicity != easiness.

For one, it would introduce another system dependency to our architecture. In a self-hosted world, we would need to stand up, monitor, and maintain an ElasticSearch cluster. In a hosted world, this would be a little easier, but we'd still need to build in failsafes and retry semantics to get our data in and out of the search service.

But even if we used a hosted service, we'd still need to think about data privacy. Our users trust us with sensitive business data, and we would have to guard against unauthorized access. We also take account deletion very seriously, and if they deleted their accounts, this introduces another place we would have to ensure was properly wiped clean.

What if there was a service that we had already setup to be highly available, already had sufficient monitoring in place, and already connected with our application and authorization system...

Just use the database (Postgres)

Luckily, Postgres natively supports full text search and is plenty flexible for our use case. (Note: The code samples in this section are taken from our real Ruby (on Rails) code, but should be easily adapted to any language)

Let's start with what we we're searching for. Monolist search supports multiple result types, including Google Drive files, contacts, and various types of tasks. For the sake of simplicity though, let's assume we only have two domain models, contacts and pull requests.

class Contact < Model
  attribute :user_id
  attribute :email
  attribute :display_name
  attribute :last_sent_at
end

class PullRequest < Model
  attribute :user_id
  attribute :title
  attribute :description
  attribute :repository_name
  attribute :created_at
end

These are some (adapted) examples from our own models. The fields are mostly self-explanatory. Contact#last_sent_at is a denormalized timestamp of the last time this contact was the recepient or sender of an email for our user. Now let's make these objects searchable.

There are two main database functions we will use for full text search in Postgres. First, to_tsvector parses and tokenizes a string into lexemes, which are basically words.

monolist=> SELECT to_tsvector('simple', 'A quick brown fox jumped over a lazy dog');
                               to_tsvector
--------------------------------------------------------------------------
 'a':1,7 'brown':3 'dog':9 'fox':4 'jumped':5 'lazy':8 'over':6 'quick':2
(1 row)

To query a tsvector, we can use the @@ operand and a tsquery.

monolist=> SELECT to_tsquery('simple', 'red') @@ to_tsvector('simple', 'A quick brown fox jumped over a lazy dog');
 ?column?
----------
 f
(1 row)

monolist=> SELECT to_tsquery('simple', 'fox') @@ to_tsvector('simple', 'A quick brown fox jumped over a lazy dog');
 ?column?
----------
 t
(1 row)

Note, the first argument to both functions is a Postgres Full Text Search "config", which you can read more about here, but we won't be using for now.

Simple right? We can just query for results directly on the attributes, something like SELECT * FROM contacts WHERE to_tsquery('simple', 'QUERY') @@ to_tsvector('simple', 'display_name'). However, this only works for one table. If we have multiple tables (pull_requests), we'll have to outer join them as well. Let's add a new denormalized table.

class TypeaheadResult < Model
  attribute :user
  attribute :searchable
  attribute :result_type
  attribute :object_id # contact or pull request id
  attribute :object_type # contact or pull request
end

Here, searchable refers to the indexed attribute (title for pull requests, or display_name for contacts).

But wait, another table? Aren't we storing unnecessary data? KIS: We don't have a storage problem yet, let's not prematurely optimize.

One thing we can (maturely) optimize is based on a learning from our initial search. These tables are quite large (10 million + contacts, many more action items or google drive files). Even an indexed join takes up a significant portion of our 100ms allocation. Let's denormalize further and bring our whole base models into the TypeaheadResult. We can make the searchable column json instead of text, and then write a service to populate it for a user.

class PopulateTypeaheadResultsForUser
  def call(user:)
    user.contacts.map do |contact|
      TypeheadResult.create!({
        user: user,
        result_type: :contact,
        searchable: { email: contact.email, display_name: contact.display_name },
      })
    end
    user.pull_requests.map do |pr|
      TypeheadResult.create!({
        user: user,
        result_type: :pr,
        searchable: { title: pr.title, description: pr.description, repo: pr.repo },
      })
    end
  end
end

Hold on, but what about our to_tsvector function. We can't use that on JSON can we? Of course we can, it's Postgres.

monolist=> SELECT to_tsquery('simple', 'cat') @@ to_tsvector('simple', '{ "name": "fox", "type": "canine" }'::jsonb);
 ?column?
----------
 f
(1 row)

monolist=> SELECT to_tsquery('simple', 'fox') @@ to_tsvector('simple', '{ "name": "fox", "type": "canine" }'::jsonb);
 ?column?
----------
 t
(1 row)

Let's write a service to actually retrieve the results for a user:

class GetTypeaheadResultsForUser
  def call(user:, query:)
    TypeaheadResult.find_by_sql([
      <<SQL,
        SELECT * FROM typeahead_results
          WHERE user_id = ? AND to_tsquery('simple', ?) @@ to_tsvector('simple', searchable),
      SQL
      user_id,
      query,
    ])
  end
end

And finally, we can precompute the tsvectors and add an index to this column to make this query much, much faster.

ALTER TABLE typeahead_results ADD COLUMN searchable_vector tsvector;
UPDATE typeahead_results SET searchable_vector = to_tsvector('simple', searchable);
CREATE INDEX ON typeahead_results USING GIN (user_id, searchable_vector);

Great! We have a comprehensive and simple solution. However, we still have no notion of relevancy. If a user types in "a", the first result may be "alice@monolist.co", or it may be "analytics-marketing-campaign@marketing.google.com'.

Postgres actually has many solutions for relevancy, mainly around the handy ts_rank function. We could define a relevancy metric based on which indexed field we matched (repository name should be less relevant than title for a pull request), the length of the match, and the type of the result...

Or, we can KIS. From talking more to our users, we decided that (lastinteractedwith || updatedat || createdat) was a good enough proxy for relevancy. In other words, take the items the user has interacted with most recently, or have been updated or created most recently first. For contacts, we can use last_sent_at, and for pull requests, we can use created at. After adding a new column to typeahead_results our new services look like this:

class PopulateTypeaheadResultsForUser
  def call(user:)
    user.contacts.map do |contact|
      TypeheadResult.create!({
        user: user,
        result_type: :contact,
        searchable: { email: contact.email, display_name: contact.display_name },
        last_referenced_at: contact.last_sent_at,
      })
    end
    user.pull_requests.map do |pr|
      TypeheadResult.create!({
        user: user,
        result_type: :pr,
        searchable: { title: pr.title, description: pr.description, repo: pr.repo },
        last_referenced_at: pr.created_at,
      })
    end
  end
end

class GetTypeaheadResultsForUser
  def call(user:, query:)
    TypeaheadResult.find_by_sql([
      <<SQL,
        SELECT * FROM typeahead_results
          WHERE user_id = ? AND to_tsquery('simple', ?) @@ searchable_vector
          ORDER BY last_referenced_at DESC
      SQL
      user_id,
      query,
    ])
  end
end

There's still one more problem. Now that we're ordering results, the most frequent results will push out the others. For the query, "a", there may be tons of pull requests in the "analytics" repository that flood out the "alice@monolist.co" contact who sent us an email last week. We could implement some sort of intelligent scheme, where depending on query length we priortize one result type or another, or...

KIS We could also just limit results by type:

class GetTypeaheadResultsForUser
  def call(user:, query:)
    TypeaheadResult.find_by_sql([
      <<SQL,
        SELECT * FROM typeahead_results
          WHERE user_id = ? AND to_tsquery('simple', ?) @@ searchable_vector
          ORDER BY last_referenced_at DESC
      SQL
      user_id,
      query,
    ]).group_by(&:result_type)
      .map { |k, v| [k, v.take(5)]}
      .values
      .flatten
  end
end

And now, the moment you (and we) are all waiting for. How does this relatively simple solution perform?

IT FLIES Even with millions and millions of typeahead_results, the indexed query with no joins gives us results around 100ms. With some clever React work (which we'll talk about in a subsequent post), we were able to get the entire request->response->render median time to just under 100ms (97ms).

Tradeoffs and future

Obviously, there are tradeoffs in this solution. For one, the denormalized table requires us to be very careful about updates and deletions. Luckily, Postgres supports indices on nested JSON keys, so we can index contact.email for example, and then just delete and reinsert all results when a specific contact is updated.

Another large tradeoff is flexibility. ElasticSearch is obviously more powerful than Postgres, and from a scaling perspective has a better story. We can horizontally scale ES instances separately from our database, since our search index may grow relatively independently of live data volume.

Lastly, our search may be faster, but is also more shallow. We're no longer searching email or drive file content, but we haven't found this to be problematic for our users, since the "See all results" link still takes the user to a (slower) deep search experience when necessary.

All in all, the response to our search improvements has been unanimously positive, and we think these tradeoffs were necessary decisions in our pursuit of a fast, simple, and maintainable solution.


Want to see for yourself? Tired of going to 5 different tools to search for the "Product Spec"? Try Monolist for free here.

Follow us on Twitter