INFOCOM 2000 SCRIPT Comments are in square brackets. Comments include: - 1) Pause - Pause at this moment. 2) Stress - stress the next word. 3) Begin animation, 4) End animation, 5) Begin read from slide, 6) End read from slide. 1) [Start slowly, loud, clear] Good evening everybody. In this talk I shall present the Parallel Packet Switch. This is a joint work done by my colleague Amr, my professor Nick McKeown and myself Sundar Iyer at Stanford University. 2) Our goal as described here, is to build an extremely high-speed packet switch. [Pause] 3) As an example, we would like to build the following multi-terabit output queued switch. The switch contains 64 ports with each line interface running at OC3072 rates. Likewise it can also be considered as a stream of 4 times OC768 rates. This corresponds to an interface speed of 160Gbits/sec. Hence each 40byte cell would arrive at a rate of 2nsec. What are the potential problems that on might face in building such a switch? [Pause] In order to understand the main problem, it is important to note that a packet switch receives packets from various input ports. The switch then buffers these packets internally and switches these packets out. 4) [Begin animation] We note that irrespective of the architecture of this switch, each interface will have to receive packets at a rate of 160Gb/sec i.e. one cell arrives every 2 ns. Similarity a cell will have to be read out once every 2 ns. The sequence of events is as follows: 1) Cell 1 arrives and is written into memory. 2) In the next 2 ns a cell is written in and the switch fabric also reads out a cell. 3) In the next 2 ns again the same set of events occurs. This set of events may continue for all time slots. So we need to buffer and read cells into a memory at a rate of two every two nanoseconds i.e. once every nanosecond. Optical technology has made it possible to have line rates of the order of 160Gb/sec. However fast optical memories, which can potentially buffer these packets, are as yet not economically viable. Also, conventional memories do not seem to scale to these high access speeds in the near future. And therein lies the main problem. i.e. "How to buffer cells in a memory at a rate of 1ns ?" To generalize this problem, we ask the question as to "how we can one buffer cells at the line rate". [End animation] [Pause] We would like you to ponder on different techniques to solve this problem. 5) With this in mind I would like to re-state our main objective i.e. "to build an extremely high speed packet switch with memories running [stress] slower than the line rate". One fairly [stress] simple approach that we took is to use parallelism in order to solve this problem. We wanted to be able to use multiple of the fastest [stress] conventionally available switches to build the high-speed packet switch. Here I would like to define the specific fairly obvious architecture, which we used. We call this switch a [stress] "Parallel Packet Switch" 6) [Begin read from slide] A parallel packet switch is comprised of multiple identical lower-speed packet-switches operating independently and in parallel. An incoming stream of packets is spread, packet-by- packet, by a demultiplexor across the slower packet-switches, and then recombined by a multiplexor at the output. [End read from slide] 7) This shows the architecture of the parallel packet switch. It is important to note the following four key aspects of the parallel packet switch. 1) There is no buffering anywhere other than the central stages of the switch. All these buffers run at a line rate of R/k. [Sundar: What about output buffers at NR/k, should we mention these]. Specifically the inputs i.e. the demultiplexors and the outputs i.e. the multiplexors have no buffers and hence there is no buffer running at the line rate of R. 2) Each of the central switches is chosen such that they can be built using existing conventional technology. 3) The parallel packet switch is a single stage switch. i.e. each packet sees only one center stage switch in its path from an input to an output. 4) We shall assume, for the purpose of this talk, that the central stage switches are all output queued. Each of them have N ports, i.e. the same number as that of the big switch that we are trying to emulate. These ports accept an input line interface rate of R/k. However these switches can very well be built from a variety of suggested switching architectures such as the input queued, combined input output queued etc. The use of the above techniques can further reduce the bandwidth of each of the internal buffers. 8) Now that we have seen the architecture, we would like to ask the following questions:- [Pause] [Read from slide] 1) Can a Parallel packet switch emulate a single big output queued switch? 2) Can it provide delay guarantees? Can it support strict priority, WFQ and other qualities of service? [End read from slide] Can we throw any traffic pattern on it? Will it maintain the throughput of the output queued switch running at the line rate R.? 9) In fact we will demand a stricter requirement. We want to know whether the PPS will precisely emulate an output queued switch? [pause]. What do we mean by precisely emulate? Consider the following "thought experiment". [Begin animation] The first switch is an output queued switch. And the one shown below is a Parallel Packet switch. Let both of these switches receive the same traffic pattern. Then the PPS is said to precisely emulate an OQ switch if all packets leave both the switches at identical times. The comparator in the figure checks whether the corresponding packets from the OQ switch and the PPS leave the system at the same time. For the PPS however, we will tolerate an added extra latency for each cell since it travels over a slow speed internal switch. It may appear at this stage that the switch will be able to emulate an OQ switch. [Pause] In fact since all middle stage switches are OQ'ed it appears that the PPS as a whole will behave like an OQ switch and should be able to precisely emulate the above switch. [Long pause. Let that comment sink in] 10) We will now see an example as to why this might not occur. Consider the scenario in this PPS with N inputs and 3 middle stage layers. [Begin Animation] Consider all packets, which are destined to output, port two. The numbers on these packets denote the order in which they should depart from the PPS. 1) For the moment assume that packets 1, 2, 3, 4 and 5 are consecutive packets to output port two and are queued as shown. 2) Initially packets 1, 2, 3 are read one after the other from the three different center stage switches. 3) Since the center stage switches are running at a line rate of R/3 it will take [Stress] three time slots for the packets to reach the output. 4) In the 4 th time slot packet 1 leaves the system and packet 4 is being read from the center stage switch 1. 5) In the 5th time slot packet 2 leaves the system and packet 4 is still being read from the slower speed link running at a line rate of [stress] R/3. 6) In the 6th time slot packet 3 leaves the system and packet 4 has reached the output. 7) In the 7 th time slot packet 5 is being read from the center stage switch 1, while packet 4 leaves the system. 8) Now in time slots 8 and 9 packet 5 is still being read from the center stage switch while the external link is empty. 9) In the 10th time slot packet 5 leaves the system. Thus there is a delay of 2 time slots where there is a cell for an output while the link is empty. [End Animation] Thus we see that if multiple inputs send [stress] consecutive packets to the same center switch for a given output the PPS cannot emulate an OQ switch. It is obvious that if all packets for a given output were sent to the same middle stage switch, the throughput of such a switch over that output would fall to as low as R/k. [Pause] But, why did this occur? Clearly this happened because packets 4 and 5 were queued in the same layer. [Pause]. You might ask the following question: "Can we not avoid this phenomenon by sending packet 5 to a different layer?" In the paper we show that if the PPS is implemented as is, this phenomenon cannot be avoided. We show a specific traffic pattern, which leads to consecutive cells for a given output reaching the same center stage switch. Thus we show that sometimes an input has no [stress] choice but to send a cell to a switch, which already contains an immediately preceding cell, destined to the same output as in this example where packet 5 is sent to the same layer as packet 4. Why does this lack of choice occur? The following example shows clearly why this occurs. : - 11) [Begin Animation] 1) Let 4 cells be destined to output port 2, in the first time slot. Each of these cells is spread over the different center stage switches as shown. Specifically cell one, two and three are sent to center stage switches one, two and three respectively. Cell 4 is sent to center stage switch one. A note on the color scheme: - [Begin explain color scheme] Packets sent to output port two are shown in red. These are the packets of interest to us. Packets shown in green are destined to some other output port. A white box implies that there is no cell arriving at that time slot. [End explain color scheme] 2) In the next time slot, only input one receives a cell. The cells, which arrived in the previous time slot, are still being sent to the respective center stage switches. Assume that this newly arriving cell is not destined to output port 2. Since it cannot be buffered it must be again sent immediately. As link one is still busy sending packet one to output port two, the input used an alternate center stage switch, in this case switch 2. 3) Similarly in the third time slot, the input port again receives a cell not destined to output port two and it begins to send this cell on its next available free link, which is link 3. 4) In the 4 th time slot, input port one receives a cell destined to output port one. This newly arriving cell is shown in red. Since the only link which is free is link one, the input is forced to send "packet 5" to output port one leading to: - 12) ... the fact that packets 4 and 5 which both belong to the same output are queued in the same center stage switch. [Long pause] So we see that we are unable to avoid consecutive packets from reaching the same center stage switch. Well, is there a way to increase the choice of switches, which a given input can choose so that input port one could have sent packet 5 to center stage switch 2? [Long Pause: Allow this fact to sink in] 13) One obvious way to increase the choice of links from which to choose is to increase the speed of the intermediate links. If we increase the speed of the internal links by a constant factor "S"; then we denote "S" to be the speedup factor. This is the same as increasing the number of intermediate links by "S" times. [Do we need the following statement?. I think we do as it makes things very clear later on - Sundar] In fact if the intermediate links were all speeded up by a factor "k", where "k" is the number of center stage switches then there will be ample choice, because all links are free at any given time slot. However that defeats the purpose of building a PPS, since we don't want internal links to run at a rate of K(R/k) = R. Can we do better? We ask the question as to whether we can have sufficient choice with a low speedup to prevent a single center stage switch from being utilized consecutively for a given output. In order to understand this better we shall study what effect does increasing the speed of intermediate links have on choice. We will take the previous example and see what happens with a speedup of two. [Begin animation] 1) Under the same traffic pattern the cells are sent by the PPS to the center stage switches in a similar manner as before. [Continue animation] 2) However we see that when cell 5 arrives these is a choice of sending it to either switch 1 or switch 2. Layer one would have already completed sending cell 1. But link 2 is also available. This is because the packet in link 2 was sent two time slots ago and since the line rate of the link is 2R/3, the packet would have completed transmission when packet 5 arrives. [End animation] Hence increasing the speed of the internal links surely gives more choice and helps prevent multiple consecutive cells from being sent to the same layer. 14) We shall now see the generic effect of speedup on choice. Consider an input, which receives packets at a rate of R. Assume that it sends these packets to 10 center stage switches. Let the speedup be equal to 2. Thus each link runs at a rate given by 2 times (R/10) which is R/5. [Begin Animation] 1) Let a cell that arrives at time slot 1 be sent to any link say, link 1. 2) Let a cell that arrives at time slot 2 be sent to some other link, say link 2. We note that link 1 cannot be utilized since it is still in the process of sending cell 1. This is because link 1 is a slow speed link running at a rate of R/5. 3) Similarly, when a cell arrives at time slot 3, we cannot utilize any of links 1 and 2. So we use a third link say link 3. 4) Proceeding ahead, at the end of 5 time slots, we are forced to utilize 5 distinct links. We notice that at the end of 5 time slots, link one becomes free once again. Hence at time slot 6 the input can choose from, any of the 5 as yet unused links, or from the 1 old link, which has now become available to be used again, Thus in general at any time slot, only the previous k/S links would be transmitting packets. And the link sent exactly k/S time ago would become available again to be used. [End Animation] Is this choice sufficient to prevent consecutive packets from being sent to the same center stage switch? [Pause: Let audience think of this.] 15) In order to understand this better, we shall now formalize these concepts, which we have learnt. Consider any cell, which arrives at input port i at time slot n. [Read from slide] We define the available input link set, AIL (i,n), as the set of layers to which external input port i can start writing a cell to, at time slot n. [End read from slide] We note that there is a similar constraint at the output. To understand this constraint we will now define the departure time of a cell in a PPS. 16) [Begin animation] 1) We again begin by referring to the output queued switch we are trying to emulate. Let a cell arrive at time n to both the FIFO OQ switch as well as the PPS. 2) Then the cell is switched in the output queued switch and is queued in the OQ for sending out. 3) The cell may take some time for it to reach the head of the OQ as cells ahead of it are read out again in a FIFO manner. 4) Finally at some time n' in the [stress] future the cell is switched out. [End animation] 17) Hence the cell departure of a cell arriving at the PPS at time n is the departure time from an equivalent FIFO OQ switch, which receives the same arrival traffic pattern. This departure time is denoted by n'. 18) Now, consider the same cell which arrives at input port i and arrives at time slot n. Let us assume that it is destined to output port j, and should leave the system at time slot n' in the future. [Stress, go very slow] The output j would have read the previous consecutive cells from certain middle stage switches. Hence depending on which center stage switches were used the output j can read from only certain other switches at time slot n'. [Read from slide] Thus, we define AOL (j,n'), as the set of layers that external output j can start reading a cell from, at time slot n' in the future. [End read from slide] Note, that AIL allows the inputs to write the cells smoothly to different center stage switches. Similarly AOL allows cells to be read smoothly from different center stage switches. We want to see under what conditions can we always have enough choice so that both the inputs can write smoothly and the outputs can read smoothly. 19) This part is key to understanding the PPS. Consider a cell which arrives at an input port i at time slot n. Say it is destined to output port j at time slot n' in the future. Then the cell must be sent to a layer from its AIL set. Also at the moment when the cell is being read out at departure time n' in the future, it must meet its AOL. Hence if every cell musts meet both it's AIL and AOL sets then each input and each output will be able to write or read consecutive cells from different center stage switches. [Begin animation] We shall use the same example and traffic pattern as previously shown. We consider packets only arriving at a specific input port, in this case input port one. Let us assume that these packets are destined to a specific output port, in this case output port "N". In the example shown, since the internal links have a speedup of two, both links 1 and 2 are free when packet 5 arrives. 1) The layers in green belong to the AIL of input port 1. 2) Also the AOL layers are shown in green. Output port N, will have read cell 1,2,3 and 4 in that order from center stage switches 1,2,3 and 1 before trying to read cell 5 at future time n'. 3) Hence the newly arriving cell can be read from only center stage switches 2 and 3. 4) Since the cell can be written to only switches 1,2 and must be read from either of switches 2,3 it can only be sent to the intersection of these two switches i.e. center stage switch 2. [Go slow] But what happens in a generic PPS. Does there always exist such a layer? 20) In order to answer this question we must now see how the choice sets looks like. Though the actual size of these sets may vary, we can make a statement about the minimum size of these sets. In fact the minimum size of these sets is given by :- = K - (The maximum number of links which can have cells in progress). = K -K/S +1. 21) Hence to assure choice to both inputs and outputs we need that: - [Begin minor animation] AIL intersection AOL cannot be null. Since these sets define layers it suffices to say that if |AIL| + |AOL| > K , then the above condition is met. Thus, we have that (k - k/S + 1) + ( K- K/S +1) > K. This implies that S > 2k/k+2. [End minor animation] 22) The following is the main result of the PPS:- [Begin read from slide] [Begin minor animation] If S > 2k/(k+2) approximately equal to 2 then each cell is guaranteed to meet its AIL and AOL. If S > 2k/(k+2) approximately equal to 2 then a PPS can precisely emulate a FIFO output queued switch for all traffic patterns. [End minor animation] [End read from slide] [Long Pause] 23) Till now we have been tried to emulate a FIFO queuing strategy. But what did we start out to achieve? Well we wanted to create an output queued switch using existing switches. Now, this output queued switch might send packets out in a different order than a FIFO manner and may re-order these packets. 24) More precisely, Quality of Service is an important requirement of switches. We then asked the question as to whether the PPS could support a WFQ strategy. We found that: - [Begin read from slide] If S > 3k/(k+3) approximately equal to 3 then a PPS can precisely emulate a switch with WFQ, or strict priorities for all traffic patterns. [End read from slide] 25) Well we started out by trying to build a very high-speed switch. Are the results that we that we have now practical? It turns out NOT to be the case. In all our discussions up till now, in order to make a decision as to which layer to send a cell to we needed 2 pieces of information. The AIL set was local to the input and was maintained by each input. However all inputs needed precise information about the output. i.e. We needed to know how the cells were distributed amongst the center stage switches for a given output. Thus the AOL information was required to be globally available amongst all inputs. This meant a centralized algorithm wherein a decision had to be made by each output for all inputs. This leads to a very impractical solution when N is large. Hence we came up with a distributed algorithm where each input maintains both the AIL and a local copy of the AOL. Then the following result holds true: - If S > 2k/(k+2) approximately equal to 2 then a PPS can precisely emulate a FIFO output queued switch with a latency of Nk/S external time slots, in a distributed manner for all traffic patterns. [Pause] This result is very important because in the distributed PPS algorithm each of the inputs act independently based completely on its own local information. Hence such a distributed PPS algorithm is very practical to implement. Also the latencies associated are not much, considering that these latencies are of the order of time slots in nanoseconds. 26) Conclusions. [Begin read from slide] We can thus conclude that it is possible to build a very high-speed single stage switch using multiple smaller switches. Such a switch can emulate an output queued switch running at the aggregate rate. [End read from slide] Thank You.