Meeting Notes
Neda Beheshti, Facebook
Presented the results of a 2016 experiment on Express Backbone, which links their datacenters. Experiments were ran with TCP Reno, without ECN, pacing, or a max cwnd. Linux TC was present on some hosts to enforce a contractual max rate. They measured utilizations, losses, and RTTs aggregated over minutes.
They reported one particular experiment between ATN and FRC where they saw congestion. Each switch had a 3Gb shared buffer on ingress cards. They tested three load-balanced links, setting the virtual queue of the first to 500,000 packets, the second to 2,000 packets, and the third to 200 packets.
They observed how the buffer size impacted network metrics, including packet loss and link utilization. The smaller buffers saw more packet drops at higher utilizations (maxing out at 0.15%). They observed very large RTTs (300ms) for the 500,000 packet link, while the 200 and 2,000 links were closer to 50-100ms. Link utilization was lower for the smaller buffers: 3-4% lower for the 2,000 packet link, and 6-10% lower for the 200 packet link.
Losses showed up when the smoothed utilizations were roughly more than 60%.
They also tested the effect on flows with a synthetic loadgen test. They saw the lowest flow completion times with 2,000 packets.
They would like a better understanding of how the buffer size impacts application level metrics, would like finer-grained metrics, and would like to explore the effects of alternate congestion control algorithms.
Lincoln Dale, Google
Presented the conclusions of experiments between edge caches and youtube clients for a congested peer in LA. They have multi-100G interfaces which are fully utilized. They tested by reducing the buffer size on these interfaces until they began to see drops, and observed network metrics and video rate. The traffic was a mix of QUIC with Cubic, and was paced. End-to-end delay was about 10ms. There are control loops which automatically reroute traffic, which may have impacted the experiment. Concluded that 10ms of buffers is enough for all their workloads, 5-10ms of buffers is enough except for in 10% of the queues.
TY Huang, Netflix
Presented the effects of buffer sizes on video QoE. Showed the results of a live experiment between a CDN PoP and a congested ISP peer (which was not cellular). They had two routers, one with a 25MB virtual queue and another with a 500MB virtual queue, and traffic was roughly load balanced between the two (though it was different). The traffic was video, using BSD’s implementation of TCP New Reno with RACK and without pacing. They observed network-level metrics and application-level metrics.
For network-level metrics, the results were normal. They observed an 100ms improvement for the median min RTT of a video for the smaller buffer, and a 150ms improvement for the median max RTT of a video. Utilization was similar, packet loss increased to 2% with the small buffer, and decreased to about 0.8% with the large buffer.
For application-level metrics, there was a significant reduction in rebuffers for the small buffer (about 30%). All percentiles of clients saw a reduction in rebuffers. The startup play delay decreased by about one second for the smaller buffers, which lead to 20-30% more supplemental videos played. The play delay was not lower across all percentiles: clients in the top 35% of play delay metrics had higher play delays with the small buffer, while the rest of the clients had lower play delays with the small buffer. This seemed like a reasonable tradeoff. Video quality was higher with the small buffer stack, across all percentiles.
They concluded that smaller buffers seemed to improve QoE, and that application-level metrics are another interesting thing to measure in buffer sizing experiments.
Honqqiang Liu, Alibaba
Presented interviews with coworkers at Alibaba on their thoughts on buffer sizing. People described the usual trade-offs between higher RTTs and utilization/loss. In a WAN, they care more about utilization than price and vice-versa for datacenters. Pointed out that legacy applications in private clouds can require large buffers. Raised a concern about live video having super spiky traffic patterns which could necessitate large buffers.
They would be interested in congestion control which reduces buffer size requirements (e.g. HULL) and having microsecond granularity buffering metrics.
Ken Duell, AT&T
In practice nobody cares about academic sizing rules. AT&T has a large diversity of traffic, with predictability/burstiness ranging from scheduled backups to live tv to video downloads to carrier traffic. They need to design buffers to meet service levels for their customers. They have no control on their customers as opposed to Google.
Buffers vary from 10’s MB (TR, Data centers) per chip to GBs (backbones, location). Buffer cost is not an issue anymore, they pay about $1.50/GB for extra memory. Cannot afford to retransmit on cellular networks. In core most buffer lightly loaded except for few microbursts per day (periods of 3 seconds where utilization is high).
They have high-water mark metrics for buffer utilization, and occasionally see “microbursts” of spiky 3-second intervals where the utilization is high.
They can adaptively size buffers on routers, and often do so in response to problems. They find increasing buffers tends to eliminate some problems for cellular/voice traffic and last-mile traffic. The sizing range is between 5 and 10ms, not orders of magnitude.
Are working on a related paper submission for SIGCOMM 2019 (ConQuest). They would like to understand why these microbursts happen, and have a better sense of why changing the buffer by a little bit seems to fix some problems.
Joel Jaeggli, Fastly
A CDN provider. Each caches attached to transit and IX providers. Caches do full routing functions. Exposes a couple of situation where links got congested and packet dropped. Same microburst concern but claim it is rare and hard to quantify the impact on buffer size on microburst. Their network is 20ms in diameter, so 40ms buffers are not an option. A maximum buffer of 5ms is more ideal.
They have 10G switches, 8MB-12MB buffers. They note that inter-cache and WAN traffic shares the same buffer, which makes RTT-based sizing less obvious. For them, buying switches with small buffers is cheaper. They are not especially concerned about loss, except syn-loss which causes a 1-3 second timeout.
Simon Leinen, Switch (Swiss)
Discussed operator concerns around buffer sizing. There are economic concerns; buffers are a good way for vendors to differentiate products with different prices. As operators, they don’t care much about the throughput of a single long-lived TCP flow. He would like to avoid buying a switch with a too-small buffer, and then being responsible for the resulting problems. Suggested building tools to provide transparency about the buffer needs of a particular router, and sharing experimental results.
Bob Briscoe, CableLabs
Working on standardizing the new DOCSIS modems. They have designed a two queue AQM for two groups of congestion control algorithms. One is for classic congestion control algorithms (Reno, Cubic, BBR, etc…), and another for new “L4S” congestion control algorithms (e.g. DCTCP) and light traffic. Suggested that TCP with tiny sawteeth would be better from a buffer requirements perspective.
The new DOCSIS modems will have metrics for queue length.
He is interested in how flows in slow-start impact buffer size requirements, and how different congestion control algorithms impact each other.
Chuanxiong Guo, Bytedance
Talked about buffering in Bytedance’s datacenters. The top of rack switches have buffers of 12-32MB, and the aggregation switches have larger buffers of several GB. Were hoping to use ECN to reduce buffers for aggregation switches, but argued that precise ECN marking on egress queue length in VoQs is hard. However, Arista and Barefoot said they’ve fixed this using the packet latency instead of queue length for ECN.
They have a lot of traffic for machine learning, where there’s a pretty straightforward metric: time to finish training. Are considering doing some machine learning to optimize buffer size.
Igor Gashinsky, Oath
Described Yahoo’s experience with buffer sizing. Most of their traffic is Reno and Cubic, without pacing, and they have 10G links. Instead of pooling connections, they multiplex many TCP flows into one flow to reduce flow startup cost. This makes them very loss-sensitive, since a single loss cuts the cwnd for hundreds of flows simultaneously. DCTCP worked poorly for them – noisy ECN caused significant underutilization.
They did the usual test where they took load-balanced peering links and set 100ms, 20ms, etc… buffers. They increased the buffer size until zero loss occurred. They found 23.5ms was about the right number for this.
Their internal rule-of-thumb for buffer sizing is to have 2 ports have a BDP worth of buffers, with an RTT of 1.25ms. For a 40G link, this is 2•1.25• 40 = 10MB of buffers.
Some of their applications had a timeout of 200ms, which was a much higher delay than usual. They found that by reducing the fast retransmission timer, they were able to do about as well as a deep buffer switch.
Parvin Taheri, Cisco
Discussed an AQM Cisco is developing to reduce queueing delay for datacenters. They observe that most flows are small, but lots of traffic is sent by a few large flows. Elephant/mice flows are automatically classified by total amount of data transferred, then elephant flows are marked/dropped at a much lower threshold than mice flows. Mice and elephants are also queued separately. They showed some synthetic experiment results suggesting this results in lower flow completion time for mice without much penalty for elephants.
Francois Labonte, Arista
Arista buffers are usually 10s of MB and on-chip, one family is GBs and uses DRAM. Everything is input queued. They report that often customers will begin using a switch with a larger buffer and see different behavior. Arista would like to add more knobs/metrics around buffer sizing.
Golan Schzukin, Dune/BCM
Discussed buffers on Broadcom’s Jericho chip. On-chip, it has 16MB of buffers, off-chip it has 6GB. Using on/off-chip buffers is configurable. Have metrics around buffering for each queue: number of dropped packets, max queue size, packet latency, low water marks, rates, etc…
Chang Kim, Barefoot
Showed a demo of how to use DTEL, a p4 library for data plane telemetry. Showed a graph of queue latency over time, reported when the queue latency changes by more than a certain threshold. At 256us, things looked kind of weird, at 16us you could see a sawtooth. Showed similar graphs with 25 and 50 flows. Presented another demo which showed how a queue was divided between flows over time.