How Glean uses Apache Beam Streaming to efficiently read a large corpus from MySQL
At Glean, we crawl corpuses of hundreds of millions of documents every day. Once all these documents are crawled, we need to ensure they are constantly parsed and processed to keep content, permissions, and statistics fresh in order to deliver the best possible search experience.
Running a batch job for this purpose was costly, fragile, and difficult to integrate with MySQL. With a streaming job, however, we were able to re-process every document at least once a day. In this blog post, we describe how this was achieved using Cloud SQL (MySQL) and Apache Beam on Google Cloud Dataflow. All the pipelines described henceforth use a single worker, multi-core setup on Google Cloud Platform (GCP) which provides massive horizontal scalability which is sufficient for our use cases.
On crawling a new document, the raw content is written to a MySQL table. We also maintain a Pub/Sub queue that is notified with the unique ID of the newly crawled document. This document needs to be processed to parse and extract structured information which is written back to the same MySQL table and indexed by our search system. The Pub/Sub queue ensures this is done in real time. Documents also need to be re-processed periodically to handle any events dropped by the Pub/Sub queue and to re-compute any statistics that change as the corpus evolves over time. This is achieved by continuously scanning and re-processing data from the MySQL table. In the rest of this blog post, we will mainly focus on how the system reads and processes data from MySQL.
To read documents from the MySQL table we use a streaming pipeline. Two main factors influenced our choice of streaming over batch – scalability and cost. With an ever growing corpus, a weekly batch job was not scalable and using a streaming pipeline enabled us to process each document at least once a day. A significant portion of this cost benefit came from the fact that we already needed to maintain a streaming pipeline to handle the Pub/Sub queue. Those machines were being underutilized and hence could be used to perform a streaming scan on the MySQL database at no additional cost.
To set up a streaming pipeline (See Figure 1), we use Apache Beam with the Google Cloud Dataflow runner. We define a custom UnboundedSource, each instance of which creates a custom UnboundedReader that queries a cache of documents to fetch the next document that should be processed. This cache of documents is maintained by a static instance of a scanner object. Several parallel database reads are issued by this scanner to the MySQL table based on the number of cores in the worker in use. These reads fetch documents in batches ordered by document ID to ensure each document is processed at most once in each full scan of the corpus. With this setup, we achieve processing rates on the order of hundreds of docs per second on a single core (See Figure 2).
The scanner also keeps track of the cache state - the ID of the last document read, the number of documents left in the cache etc. It uses these to determine when to fetch additional documents as well as to determine when a full scan over the corpus is complete. Then it resets its state and starts scanning the corpus from the beginning again. Thus, this sets up an infinite queue of data for our streaming pipeline.
This design and optimizations process have made it possible to ensure that we are scalable and cost effective from our smallest to our largest customers. We are constantly iterating to ensure our users can find the documents they need, when they need them to get things done at work.
We’ll go into the details of additional SQL queries and optimizations in an upcoming blog, so stay tuned! If you found this blog post interesting and would like to work on such systems please reach out!