Network Debugging


Unlike digital hardware design and software engineering where there is abundance of theoretical foundations and practical tools for verification and debugging of systems, the field of networking mostly relies on rudimentary tools such as ping and traceroute and accrued wisdom and intuition of network administrators to operate networks correctly.

Debugging networks is only becoming harder as networks are getting bigger (modern data centers may contain 10,000 switches, a campus network may serve 50,000 users, a 100Gb/s long haul link may carry 100,000 flows), and are getting more complicated (with over 6,000 RFCs, router software is based on millions of lines of source code, and network chips often contain billions of gates).

Troubleshooting a network is difficult for three reasons. First, the forwarding state - the set of rules that determines how an incoming packet is processed and forwarded by a network boxes - is distributed across multiple routers and firewalls and is defined by their forwarding tables, filter rules and other configuration parameters. Second, the forwarding state is hard to observe, because it typically requires manually logging into every box in the network. Third, there are many different programs, protocols and humans updating the forwarding state simultaneously.

Header Space Analysis

To address these challenges, we started by defining a protocol-independent abstraction model for packets and forwarding functionality of network boxes, called Header Space Analysis (HSA) [1]: A packet, based on its header bits, is modeled as a point (and a flow as a region) in a {0,1}^L space called Header Space. Each dimension in the header space corresponds to one bit in the packet header. L is an upper bound on header length that we care about.

Networking boxes receive packets on an incoming port, process them based on forwarding state of the box and forward the packets to one or more outgoing ports. Therefore HSA models networking boxes using a Switch Transfer Function, T(h,p), which transforms a packet with header h received on input port p to a set of packet headers on some output ports: {(h1,p1), (h2,p2), ... }. h is a point in the header space of the input port (p), transformed to a set of points (h_i) in the header space of output ports (p_i).

This easy-to-use formalism abstracts away the complexity of protocols and vendor-specific semantics of network boxes and gives us a model to analytically prove properties about networks that are otherwise hard to ensure. For example, in a typical multi-protocol network, it is hard to find reachability of two ports - i.e. whether packets can reach from port A to B and if so, which packets? However, using HSA, one can make switch transfer function for each box in the network by reading and parsing the forwarding states, and use those transfer functions for answering reachability questions.

In [1], we used HSA to develop algorithms and build tools for answering very important questions that are critical for operational correctness of networks: finding reachability, detecting forwarding loops, checking whether a loop is infinite and ensuring isolation of network slices. We have proved that under realistic constraints these algorithms scale quadratically with forwarding table size and linearly with diameter of networks, and demonstrated the performance on enterprise-size networks such as Stanford University backbone network.

Real Time Network Policy Checking

Software Defined Networks (SDNs) decouple control plane and data plane. This gives us a logically centralized location to change the forwarding state of networks and observe all the state changes that happen in the network. This is a new opportunity to verify every state change against design policies and invariants of the network, before the changes hit the data plane. Inspired by this, we designed and implemented a system, called NetPlumber [2] which can verify - in real-time - a wide range of policies including reachability between ports and lack of forwarding loops and black holes. NetPlumber runs on an up-to-date snapshot of the network. When a change occurs in the network, NetPlumber performs incremental computation to update the check results. In essence, NetPlumber is a fast knowledge base, reflecting the dependencies among captured network states using a graph called the plumbing network.

We tested NetPlumber on 3 networks, Google's inter-datacenter WAN which is the largest deployed SDN, the Stanford University backbone network and Internet 2 nationwide network. With NetPlumber, checking the compliance of a rule update against a single policy takes on average 50-500us. This implies that NetPlumber can check, on average, the compliance of 2000 rule updates (installation/removal of forwarding rules) per seconds.

Automatic Test Packet Generation

The only way to detect a data plane problem such as link and line card failure, high delay or low bandwidth is through active testing or passive monitoring of network at run time. These problems are characterized as a mismatch between the captured state of the network and its actual behavior. In order to detect these failures, we developed a framework called Automatic Test Packet Generation (ATPG) [3] that automatically generates a minimal set of packets to test the liveness of the underlying topology and the congruence between data plane state and observed behavior.

ATPG detects and diagnoses data plane errors by independently and exhaustively testing all forwarding entries, firewall rules, and any packet processing rules in the network. This is the same idea as software testing where a good test suite should provide a complete coverage of the program, hitting every line and branch of the code. In ATPG, test packets are generated algorithmically from the device configuration files and FIBs, with the minimum number of packets required for complete coverage. ATPG uses Header Space Analysis framework for generating test packets. As a result it can handle constraints on location of test terminals and header of test packets. The resulting test packets are fed into the network so that every rule is exercised directly from the data plane. Since ATPG treats links just like normal forwarding rules, its full coverage guarantees testing of every link in the network as well. Once an error is detected through test packets -- e.g. a missing or unexpectedly delayed packet -- a fault localization algorithm is invoked to pinpoint the exact location of the error.

We tested ATPG on two real-world network snapshots, Stanford University's backbone and Internet 2 and find that a small number of test packets suffices to test maximum number of rules in these networks (4,000 on 35,000 respectively). Sending these test packets 10 times per second consumes less than 1% of link capacity.