Optimizing performance and resource costs for a modern cloud-only architecture often results in interesting technical challenges. Here’s how we discovered and debugged a Golang memory leak.Get the full report
At Glean we’re building modern cloud-only architecture for solving some hard enterprise search and knowledge management-related problems. Performance and resource cost optimization is critical and often leads us into interesting technical challenges. Debugging one of these challenges led us to an interesting discovery that may be useful for others working on similar challenges.
At Glean, we use Golang for a moderately memory-intensive service. We also use Google Cloud Platform (GCP) for most of our deployment. We run this service as an app engine flexible instance using a custom runtime image that includes Go 1.15. We saw the following interesting behavior in our Golang service:
The memory would slowly ramp up, reach the limit (we were using 3GB as the AppEngine resource limit in this case) and then the instance would get killed, likely because it was exceeding the memory limit. Looking at the memory graph, the steady ramp-up smelled like a memory leak:
Thankfully, a memory leak on the application is not hard to debug in our case since we have access to continuous profiling data using the cloud profiler. In the past, we have seen cases of unclosed Google Remote Procedure Call (gRPC) connections causing such issues, but those were easy to debug using the continuous profiler. In particular, the flame graph in the profiler UI would clearly show heavy usage at a specific call site in such cases. In this case, that was not happening. One interesting thing the profiles revealed though was that the average heap size (i.e. by the in-use objects) was around 1.5G (i.e. ~2X less than the memory footprint app engine was seeing). This meant the memory was being held somewhere by the Golang runtime. The immediate next thought we had was whether this was a memory fragmentation issue because Golang is known to be bad in that aspect.
Luckily it wasn’t too hard to conclude that fragmentation was not the culprit either. We added a background thread that periodically logs the MemStats. An upper bound on fragmented memory can be easily obtained by subtracting HeapAlloc from HeapInuse. In particular, “HeapInuse minus HeapAlloc estimates the amount of memory that has been dedicated to particular size classes, but is not currently being used.” This amount was fairly small, ~3MB in our case.
It was also interesting to see that the values for HeapReleased were fairly large. We started looking more and came across this thread on similar issues.
The potential theory in the Golang issue thread is that Go started using MADV_FREE as the default in go 1.12. This meant it might not return the memory immediately to the OS, and the OS could choose to reclaim this memory when it felt memory pressure. However, if you go back to how containers are implemented, these are essentially just processes running under separate Cgroups. The OS, therefore, might not feel the memory pressure and will not free up the memory even though the container might hit the memory limit and get killed.
Fortunately, there’s a Golang debug flag to flip this behavior and use MADV_DONTNEED instead, by setting the GODEBUG environment variable to “madvdontneed=1”. In fact, go 1.16 has reverted to using this as the default now. The memory graph after this change looks much better and steady at 2G.
Fill out the details below to get the full report delivered to your inbox.
Leading a category— enterprise search and knowledge management— requires a strong, customer-obsessed go-to-market team partnering deeply with product and engineering.
During our three-month internship, we received incredible mentorship, worked on high-impact projects, and saw the company hit major milestones.