Engineering

Fail Fast and Fail Often: Handling API Errors at Scale

Akshay Nathan
August 15, 2019

At Monolist, we’re building the software engineer’s ideal inbox. Our users depend on us to surface all relevant and actionable tasks and context across all the tools and services they use. For a typical engineer, this includes emails, outstanding Slack messages, Github pull requests, and Jira issues or Asana tasks.

To do this, we make millions of API requests each day, spread over a dozen different services. These requests are made via ruby code running on multiple Sidekiq job processing containers across many hosts. (The following post uses Ruby and Sidekiq as examples, but the learnings are relevant to any language and job processing framework.)

As these requests succeed, much more processing has to occur to decide which data is relevant to the user, and how to present it to them. In the process of building Monolist, we’ve learned a lot about how to do this scalably, which is probably a good topic for another blog post. In this post, however, I want to share our learnings about what to do with the 100,000 requests that fail each day.

Retry early but retry correctly

When working with APIs, we generally see two classes of errors: Ephemeral errors are errors that may not occur when repeating the same request some time in the future. In contrast, persistent errors will consistently happen on each subsequent request. A connection timeout, for example, is an ephemeral error which will likely be resolved on retry. A 401 error, however, is probably indicative that a user is not authenticated and will not get automatically resolved.

Generally, persistent errors require us to change our business logic to make successful requests, while ephemeral errors will succeed on the next retry with no intervention.

At Monolist, we rely on Sidekiq’s default retry semantics for both classes of errors. For persistent errors, we deploy code fixes and let the next retry succeed. We see around 10 of these errors daily, and the failure rate decreases as our new integrations become more stable. The rest of our daily 100,000 errors are all ephemeral. While these errors usually resolve within 1 or 2 retries, we’ve had to carefully architect our systems to handle this failure volume successfully.

We believe in failing fast and often, rather than letting failed state propagate down our code paths with unintended consequences. As such, we raise exceptions for unexpected api responses, and avoid catching them in our workers.

Below is a simplified snippet of real code that we use to synchronize our users’ inboxes with Gitplace. Given a user and an api client, we poll Gitplace for the user’s pull requests, and create Monolist action items for each one we find.

def poll_gitplace(user, client)
  client.get_pull_requests.each do |pull_request|
    comments = client.get_pull_request_comments(pull_request)

    create_action_item(user, pull_request, comments)
  end
end

Note that if the api call at line 3 fails, Sidekiq will automatically retry the entire job, and hopefully it will succeed the next time. But is there anything wrong with this code?

The answer lies on line 5. Let’s say the job made it through 4 pull requests the first time, before the comments api call failed. Because Sidekiq will naively retry the entire worker, we’ll duplicate all 4 of the already created action items the next retry!

Obviously, this is not preferable. When relying on retries to solve ephemeral errors, it’s imperative that our jobs are idempotent. An idempotent job has no additional side effects when successfully rerun. In other words, we should be able to run our jobs N times, and expect the same results regardless of N. Let’s fix that.

def poll_gitplace(user, client)
  client.get_pull_requests.each do |pull_request|
    comments = client.get_pull_request_comments(pull_request)

    unless user.action_items.find { |s| s.gitplace_id == pull_request.id }
      create_action_item(user, pull_request, comments)
    end
  end
end

Now when our job is rerun, we will only create action items that haven’t already been created. This means Sidekiq can run our job however many times it wants, and the end result for our users will still be correct.

Always make progress

While we’ve fixed one bug, there’s another more silent and dangerous issue with the code above. Let’s say we have a user with 1000’s of pull requests. If we hit an ephemeral error, say a network timeout, while retrieving the comments for the 999th pull request, when the job retries we will start all over and make another 1000 api calls.

Even more problematic, even if our ephemeral error rate is low, an increased number of api calls increases our chance of hitting an ephemeral error. Thus, the jobs that are slowest and take the most resources, are not only the most likely to be retried again, but they’re the least likely to succeed on each subsequent retry! Even with Sidekiq’s automatic exponential backoff, this can debilitate our queue’s when we have hundreds or even thousands of expensive jobs failing and retrying together.

The solution to this problem is usually API specific, but follows a common principle: always track and make progress in your expensive jobs.

def poll_gitplace(user, client)
  time = user.gitplace_last_sync

  client.get_pull_requests({ created_after: time }).each do |pull_request|
    comments = client.get_pull_request_comments(pull_request)

    unless user.action_items.find { |s| s.github_id == pull_request.id }
      create_action_item(user, pull_request, comments)
    end

    time = pull_request.created_at
  end
ensure
  user.update!({ gitplace_last_sync: time })
end

In the above code sample, we store the last synced pull request createdat field. When polling Gitplace, we only retrieve pull requests createdafter the last synced request. We then make sure that stored progress tracker gets updated regardless of whether there is an exception. That way, even if our job retries, we’ll only make the necessary api requests to continue.

Note: This approach does have concurrency implications that are outside the scope of this post.

Track your failures

While our code can now handle all the ephemeral errors we can throw at it, our monitoring systems are a different story. At Monolist, we use Sentry for exception tracking and alerting, but the following strategies are applicable regardless of how your team handles exceptions.

Obviously, we can’t have 100,000 irrelevant exceptions flooding our Sentry daily. However, we can’t catch the exceptions in code because we still need them to propagate up to Sidekiq so that the jobs are retried. We also don’t want to blindly ignore these exceptions -- while each incremental ephemeral error is usually just noise, we want to know if the rate of ephemeral errors changes significantly, to alert us to some larger problem with our integration, network, or service.

def track_exception(exception)
  redis.hincrby("tracked_exceptions", exception.class.to_s, 1)
  raise Monolist::TrackedException.new(exception)
end

def poll_gitplace(user, client)
  time = user.gitplace_last_sync

  client.get_pull_requests({ created_after: time }).each do |pull_request|
    comments = client.get_pull_request_comments(pull_request)

    unless user.action_items.find { |s| s.github_id == pull_request.id }
      create_action_item(user, pull_request, comments)
    end

    time = pull_request.created_at
  end
rescue Gitplace::ConnectionTimeout => e
  track_exception(e)
ensure
  user.update!({ gitplace_last_sync: time })
end

In the end, we came up with a simple solution where we explicitly track each ephemeral exception that we’re aware of. We wrap the exceptions in a Monolist::TrackedException wrapper which we’ve added to our Sentry blacklist so we don’t see them in Sentry. Since Monolist::TrackedException is still an exception, Sidekiq will still retry the job as usual. At the same time, we increment a counter in Redis to keep track of the number of exceptions we’ve seen.

Grafana exceptions graph

In our monitoring systems, we’ve surfaced the “tracked_exceptions” key in Redis to our Prometheus instance. This allows us to add graphs like the one above to our dashboards, and alerting when the rate of exceptions changes significantly.

Conclusion

While the APIs we interface with at Monolist vary wildly, we’ve consistently found that the quality of our integrations depends heavily on how resilient we are to api errors, and how much of that complexity we can hide from the end user.

By abstracting away retry behavior, ensuring that jobs are idempotent, and making sure that we’re always getting closer to success, our end users are fully oblivious to the errors, and can focus on staying productive, writing code, and being the best they can be at their jobs.

Don’t believe me? Try out Monolist for yourself.

Follow us on Twitter