Our Cloud SQL instances are a central part of the Glean index building architecture. Among other things, they serve as a repository for all our crawled data which eventually makes its way to the served index. They also store other information like answers created by Glean users. When we started Glean, we saved all this in Google Big Table, but later moved it to Cloud SQL due to cost considerations. Historically, all our Cloud SQL instances were MySQL5 instances. We decided to migrate these instances to MySQL8 recently after GCP started supporting MySQL8.
In addition to using the latest and greatest version, the main motivation behind moving with the migration were the JSON improvements introduced as a part of MySQL8. Before MySQL8 we had cases where some upgrades took close to 48 hours to complete as some Alter Table commands operating on the JSON columns took a long time. We believed using MySQL8 would enable us to significantly speed up such operations as we might be able to update JSON in place.
Given the above constraints, we decided to use GCP DMS to migrate our DBs from MySQL5 to MySQL8. DMS is a Google recommended service when moving SQL workloads from other infrastructure to Google managed cloud SQL instances. We decided to use the continuous type of migration jobs which use an initial dump phase and then use primary secondary replication in the CDC phase. We monitored the replication lag and when the replication lag was low, we were ready to switch from the old instance to the new instance. While this condition may be satisfied at any time of the day, we switched the instance only during nights and weekends so that customer service disruption was minimized.
Historically our upgrades were run by the push_master who is continuously monitoring the upgrades. They mostly run for a period of 2 hours. However given the large data size (point 1 above), the SQL upgrade was expected to run for many days. Also we had to upgrade close to 75 projects with 2 cloud SQL instances each which would mean a manual upgrade cumbersome. We decided to implement what we call the AsynchronousUpgradeframework to solve this issue.
At a very high level, this is a state machine. The current state of the upgrade is stored in a GCS bucket in a file called the state file. There is a periodic tick that comes in, loads the state file and sends GCP queries to look into the current state of the migration. Then depending upon these 2 inputs the state machine may update the state, and take some action. One example of this can be, the current state of migration jobs is running, we look into the replication lag and it’s quite high so we stay in the same state. However, in the case the replication lag is found to be low, and the use of the system is low (point 4 above), we decide to make a state transition by patching the new instance, switching it and triggering the restarts and redeployments.
All the operations performed by this framework need to be idempotent in case there are intermittent issues. Hence they need to leave the system in a consistent state for the next tick.
While a detailed comparison of the performance of MySQL8 vs MySQL5 for the Glean workloads could be the topic of a separate post, some early indications include:
Over the course of the project we learned many things.
Fill out the details below to get the full report delivered to your inbox.
While frequent iterative releases are helpful when making ranking improvements, they can be tricky to pull off, especially in a stateful service. Here’s how we do it.
Testing and deploying our search engine cluster is critical for delivering a best-in-class search product. Here’s how we do it.
Discover through Glean’s latest study why long-term employee success hinges on great onboarding, and how information inaccessibility can cost you.
It takes one year and seven months for most workers to thrive at a new job. Discover why, and how to better set employees up for success.
Glean’s single-tenant model of service deployment guarantees best-in-class security – but it also results in complex monitoring challenges as we scale our customer base. Discover how our unique in-house tool comprehensively tackles the problem while improving our productivity and error visibility.