Incrementally deploying ranking code at Glean

Jeremy Lilley

Engineering

In Glean search, we’re always trying to return the most relevant results. When we merge results from a variety of sources—from Slack threads to Jira bugs to O365 docs—there are many dimensions in which to experiment with our ranking functions.

To help build these features, we wanted frequent iterative releases for our ranking team. But rapid turn-around can be tricky in the context of a stateful service. Particularly, our main Index Servers need to preload significant amounts of index data into memory when they restart. In many cases, preloading this data could take 15+ minutes, which required either scheduled service downtime or somewhat involved mitigation.

In practice, with this constraint, we found we weren’t deploying customer updates as often as we wanted. And our engineers found that their development cadence was unnecessarily slowed down by long server restart times.

How did we fix this problem, and move towards faster incremental deployment?

We realized that much of the problem was solved if we could avoid restarting our Index Server to receive code updates in the common case.
Fortunately, this particular server was written in Java, which has a mechanism for dynamically loading code. After looking into custom ClassLoaders, we prototyped loading a Java JAR archive from Cloud Storage and using the classes in our process, without a restart.
Even with the ability to load code from Cloud Storage without restarting, Java only allows a single implementation for a given class name. That said, we realized we could add the release tag to our class package names. For instance, for the class com.glean.ranking.TermInfo, we could load any number of versions in the server if we wrote a tool to rewrite and put the release tag in the name: com.glean.ranking.release123.TermInfo, com.glean.ranking.release124.TermInfo, etc. We used some open source libraries (e.g. ObjectWeb ASM) to help build this remapping tool. In the end, we had a script to remap the build artifacts for a specified release tag and upload to the cloud in seconds.
With the ability to load multiple versions of code dynamically from Cloud Storage, it was then a matter of plumbing these release tags through the system, so that the correct release or experiment was being used for a given query.

‍

We soon found the advantages of this approach:

More frequent, less disruptive releases: Since server restarts were no longer normally needed, we started rolling out ranking changes more frequently—nightly for internal deployment, and weekly for customers.
Experimentation: We also started using this mechanism for developer experiments; rather than coordinating use of a development cluster and waiting for a server restart, they could just upload a Jar file and quickly see the results.
Strong versioning: Errors could be associated with specific release tags, and different experiments were isolated from each other.
Rolling back support: Rolling back a problematic release could be done instantly with an online configuration change.

Like in any implementation, there was some tuning needed to make the mechanism work well. For instance:

Deciding which sets of packages to remap (the initial version didn’t remap some protobuf packages, which we quickly realized was a mistake!)
Caching/loading appropriately to avoid duplicate Jar file scans
Suppressing parallel fetches for the same classes

We found that the initial use of a given release tag has an additional latency of about a second, given the Cloud Storage retrieval and Jar decoding overhead, but that subsequent uses are cached and indistinguishable from a regular implementation. Hence for production releases, we typically send a warmup request before switching the configuration.

That said, the ability to dynamically load new releases and experiments without restarting a stateful service gives us some great flexibility and ability to iterate on improving the Glean service.

If building or using a best-in-class search product sounds interesting to you, reach out!

Ready to boost your workplace efficiency?

Get a Demo

How Glean loads instantly - Glean

At a company where we’re unencumbered by any motivation to hold people’s eyeballs as long as possible just to show them ads, Speed is our favorite feature.

Tony Gentilcore

How we analyzed and fixed a Golang memory leak

Optimizing performance and resource costs for a modern cloud-only architecture often results in interesting technical challenges. Here’s how we discovered and debugged a Golang memory leak.

Sharva Pathak

Enterprise search is hard: why it’s so behind—and what it’ll take to catch up

It takes a complex software to keep up with modern demands of enterprise search software. Glean offers reliable search for documents, applications, and more.

Eddie Zhou

Engineering

Mrinal Mohit

Engineering

Related articles

How Glean loads instantly - Glean

How we analyzed and fixed a Golang memory leak

Enterprise search is hard: why it’s so behind—and what it’ll take to catch up