By Jing Zhao, DevOps Software Engineer at DoorDash
Monitoring is hugely important, especially for a site like DoorDash that operates 24/7. In modern-day DevOps, monitoring allows us to be responsive to issues and helps us maintain reliable services. At DoorDash, we use StatsD heavily for custom metrics monitoring. In this blog post, we take a deeper look at how DoorDash uses StatsD within its infrastructure.
According to The News Stack:
“StatsD is a standard and, by extension, a set of tools that can be used to send, collect, and aggregate custom metrics from any application. Originally, StatsD referred to a daemon written by Etsy in Node.js. Today, the term StatsD refers to both the protocol used in the original daemon, as well as a collection of software and services that implement this protocol.”
According to Etsy’s blog:
StatsD is built to “Measure Anything, Measure Everything”. Instead of TCP, StatsD uses UDP, which provides desirable speed with little overhead as possible.
At DoorDash, we use StatsD configured with a Wavefront backend to build a responsive monitoring and alerting system. We feed more than 100k pps (points per second) to Wavefront via StatsD, both from the infrastructure side and the application side. We use StatsD and Wavefront to monitor the traffic throughput of different API domains, RabbitMQ queue lengths, uWSGI and Envoy stats, restaurant delivery volumes and statuses, market health, etc. Wavefront’s powerful query language allows us to easily visualize and debug our time series data. The alerting system is flexible enough to notify us through email, Slack or PagerDuty based on severity. Well-tuned alerts helps us lower MTTD (Mean Time To Detect). We encourage our engineers to customize their own metrics to fully monitor their system’s health and performance.
As a startup company, scalable issues come with high growth rate. Last year, the scalability of our metrics infrastructure was challenged. Our Growth engineering team reported that the volume of our end-user activities was much lower than expected. After cross-referencing our tracking data in other sources, we confirmed that the problem lay within our monitoring system. Solving these scaling issues along the way led us to a more scalable infrastructure.
At the beginning, we had one StatsD process running on an eight core AWS EC2 instance.
The setup was quick and simple; however, when incidents happened, the load average alert on this EC2 instance never fires even if the StatsD process is overloaded. We didn’t notice the issue until Growth engineering team got paged for false alarms. Even though we have eight cores on the StatsD EC2, the Etsy version of StatsD we are running is single threaded. Thus the overall instance average of the CPU utilization was not enough to trigger the alert. We also spent some time to see how the Linux kernel handles UDP requests and gained some visibility into the server’s capacity. By looking at
/proc/net/udp, we can find the current queue size for each socket and whether it was dropping packets.
In the example above, there are lots of dropped packets and a high
1FBD (8125) at
local_address 00000000:1FBD(0.0.0.0:8125) which is listened by StatsD.
Meaning for some of the columns:
local_address: Hexadecimal local address of the socket and port number.
rx_queue: queue length for incoming UDP datagrams.
drops:The number of datagram drops associated with this socket. A non-zero number can indicate the StatsD was overloaded.
After understanding the problem better, we started to look for solutions. There are already plenty of blogs about how people scale their StatsD, but we also tried to explore other possible ways. For example, we tried to use NGINX as a UDP proxy. We did encounter some issues with the number of UDP connections. If we want to use NGINX, we also need to figure out a way to make consistent hashing for UDP requests so that metrics will always hit the same StatsD, otherwise counters won’t be able to accumulated correctly and gauge will be showing multiple StatsD. Also potentially unbalanced hashing will cause a certain StatsD node to be overloaded. So, we decided to pass the NGINX solution at the moment.
A quick patch we did to mitigate the packet loss issue (due to maxing out a single CPU core) was to setup something we called local StatsD. Basically we installed StatsD on the EC2 instance itself and, in rare cases, inside each application containers. It was an OK short term solution, but it increased our Wavefront cost since metrics were not as well batched as before. Also the higher cardinality made Wavefront time series queries slower.
To reduce the Wavefront cost and increase the performance, we needed to aggregate our metrics before sending to Wavefront. Each StatsD proxy is assigned to different ports on the EC2. Looking at
/proc/net/udp we know that it is useful if we can allocate more memory to the recv buffer so that we can hold more data during a traffic surge, assuming the StatsD process can consume the messages fast enough after the surge. There was some tuning we did with Linux kernel. We added the following configuration into
net.core.rmem_max = 2147483647 net.core.rmem_default = 167772160 net.core.wmem_default = 8388608 net.core.wmem_max = 8388608
Our Dasher dispatch system related applications pump a lot of matrices into Wavefront, which crashed our StatsD Proxy and StatsD EC2 often. So we started to hold three StatsD EC2 instances: dispatch StatsD for the dispatch system, monolith StatsD for our monolith application and global StatsD for all other microservices to share.
The previous solution worked for a while. But with continued growth, the sharded architecture could no longer enough to handle the traffic. And only a limited number of StatsD proxy processes and StatsD processes can run on a single host. We had a lot of dropped packets from the global StatsD server. Instead of putting multiple StatsD proxies and multiple StatsD’s on the same host, we built a tier of StatsD proxies fronting another tier of StatsD processes. In this horizontally scalable architecture, we can add more StatsD proxies and StatsD when there is any dropped data.
Special thanks to Stephen Chu, Jonathan Shih and Zhaobang Liu for their help in publishing this post.
See something we did wrong or could do better, please let us know! And if you find these problems interesting, come work with us!