INFOCOM 2000 SCRIPT

Comments are in square brackets. Comments include: -

1) Pause - Pause at this moment.  2) Stress  -  stress  the  next
word.   3)  Begin animation, 4) End animation, 5) Begin read from
slide, 6) End read from slide.


1)

[Start slowly, loud, clear] Good evening everybody. In this  talk
I  shall present the Parallel Packet Switch. This is a joint work
done by my colleague Amr, my professor Nick  McKeown  and  myself
Sundar Iyer at Stanford University.

2)

Our goal as described here, is to build an  extremely  high-speed
packet switch.  [Pause]

3)

As an example, we would like to build the following multi-terabit
output queued switch. The switch contains 64 ports with each line
interface running at  OC3072  rates.  Likewise  it  can  also  be
considered  as  a stream of 4 times OC768 rates. This corresponds
to an interface speed of 160Gbits/sec.  Hence  each  40byte  cell
would arrive at a rate of 2nsec.

What are the potential problems that on might  face  in  building
such a switch?  [Pause]


In order to understand the main problem, it is important to  note
that  a  packet switch receives packets from various input ports.
The switch then buffers these  packets  internally  and  switches
these packets out.


4)

[Begin animation]

We note that irrespective of the  architecture  of  this  switch,
each  interface  will  have  to  receive  packets  at  a  rate of
160Gb/sec i.e. one cell arrives every 2  ns.  Similarity  a  cell
will have to be read out once every 2 ns.

The sequence of events is as follows:

        1) Cell 1 arrives and is written into memory.  2) In  the
next  2  ns a cell is written in and the switch fabric also reads
out a cell.          3) In the next 2 ns again the  same  set  of
events occurs.

    This set of events may continue for all time slots.

So we need to buffer and read cells into a memory at  a  rate  of
two every two nanoseconds i.e. once every nanosecond.

Optical technology has made it possible to have line rates of the
order  of  160Gb/sec.  However  fast  optical memories, which can
potentially buffer these packets, are  as  yet  not  economically
viable. Also, conventional memories do not seem to scale to these
high access speeds in the near future.

And therein lies the main problem. i.e. "How to buffer cells in a
memory at a rate of 1ns ?" To generalize this problem, we ask the
question as to "how we can one buffer cells at the line rate".

[End animation]

[Pause]

We would like you to ponder on different techniques to solve this
problem.

5)

        With this in mind I  would  like  to  re-state  our  main
objective  i.e.  "to  build an extremely high speed packet switch
with memories running [stress] slower than the line rate".    One
fairly   [stress]   simple  approach  that  we  took  is  to  use
parallelism in order to solve this problem. We wanted to be  able
to  use multiple of the fastest [stress] conventionally available
switches to build the high-speed packet switch.

        Here I would like to define the specific  fairly  obvious
architecture,  which  we  used.  We  call  this switch a [stress]
"Parallel Packet Switch"

6) [Begin read from slide]

   A parallel packet switch is comprised  of  multiple  identical
lower-speed   packet-switches   operating  independently  and  in
parallel. An incoming stream of  packets  is  spread,  packet-by-
packet, by a demultiplexor across the slower packet-switches, and
then recombined by a   multiplexor at the output.

[End read from slide]

7)

   This shows the architecture of the parallel packet switch.  It
is  important  to  note  the  following  four  key aspects of the
parallel packet switch.

1) There is no buffering anywhere other than the  central  stages
of  the  switch.  All  these  buffers  run at a line rate of R/k.
[Sundar: What about output buffers at  NR/k,  should  we  mention
these].  Specifically  the inputs i.e. the demultiplexors and the
outputs i.e. the multiplexors have no buffers and hence there  is
no buffer running at the line rate of R.

        2) Each of the central switches is chosen such that  they
can be built using existing conventional technology.

        3) The parallel packet switch is a single  stage  switch.
i.e.  each  packet  sees only one center stage switch in its path
from an input to an output.

        4) We shall assume, for the purpose of  this  talk,  that
the  central  stage  switches are all output queued. Each of them
have N ports, i.e. the same number as that of the big switch that
we  are  trying  to  emulate.   These  ports accept an input line
interface rate of R/k.

     However these switches can very well be built from a variety
of  suggested  switching  architectures such as the input queued,
combined input output queued etc. The use of the above techniques
can further reduce the bandwidth of each of the internal buffers.

8)
    Now that we have seen the architecture, we would like to  ask
the following questions:-

[Pause] [Read from slide]

1) Can a Parallel packet  switch  emulate  a  single  big  output
queued switch?

2) Can  it  provide  delay  guarantees?  Can  it  support  strict
priority, WFQ and other qualities of service?

[End read from slide]

      Can we throw any traffic pattern on it?  Will  it  maintain
the  throughput  of  the output queued switch running at the line
rate R.?

9) In fact we will demand a stricter requirement.

     We want to know whether the PPS will  precisely  emulate  an
output  queued  switch?    [pause].  What do we mean by precisely
emulate? Consider the following "thought experiment".

     [Begin animation]

         The first switch is an output queued switch. And the one
shown  below  is  a  Parallel  Packet  switch.  Let both of these
switches receive the same traffic pattern.

     Then the PPS is said to precisely emulate an  OQ  switch  if
all  packets  leave  both  the  switches  at identical times. The
comparator in the figure checks whether the corresponding packets
from the OQ switch and the PPS leave the system at the same time.
For the PPS however, we will tolerate an added extra latency  for
each  cell  since  it  travels over a slow speed internal switch.
It may appear at this stage that  the  switch  will  be  able  to
emulate an OQ switch. [Pause]

In fact since all middle stage switches are OQ'ed it appears that
the  PPS  as  a whole will behave like an OQ switch and should be
able to precisely emulate the above switch. [Long pause. Let that
comment sink in]

10) We will now see an example as to why this  might  not  occur.
Consider  the  scenario  in  this  PPS with N inputs and 3 middle
stage layers.

[Begin Animation]

     Consider all packets, which are  destined  to  output,  port
two.  The numbers on these packets denote the order in which they
should depart from the PPS.                   1) For  the  moment
assume  that  packets 1, 2, 3, 4 and 5 are consecutive packets to
output port two and are queued as shown.

        2) Initially packets 1, 2, 3 are read one after the other
from the three different center stage switches.

        3) Since the center stage switches are running at a  line
rate  of  R/3  it  will  take  [Stress]  three time slots for the
packets to reach the output.          4) In the 4  th  time  slot
packet  1  leaves  the system and packet 4 is being read from the
center stage switch 1.          5) In the 5th time slot packet  2
leaves  the  system  and  packet  4  is still being read from the
slower speed link  running  at  a  line  rate  of  [stress]  R/3.
        6)  In  the  6th time slot packet 3 leaves the system and
packet 4 has reached the output.          7) In  the  7  th  time
slot packet 5 is being read from the center stage switch 1, while
packet 4 leaves the system.          8) Now in time slots 8 and 9
packet  5  is still being read from the center stage switch while
the external link is empty. 9) In the 10th  time  slot  packet  5
leaves  the  system.  Thus there is a delay of 2 time slots where
there is a cell for an output while the link is empty.

[End Animation]

    Thus we see that if multiple inputs send [stress] consecutive
packets  to  the  same  center  switch for a given output the PPS
cannot   emulate an OQ switch. It is obvious that if all  packets
for a given output were sent to the same middle stage switch, the
throughput of such a switch over that output would fall to as low
as R/k.

    [Pause]

    But, why  did  this  occur?  Clearly  this  happened  because
packets 4 and 5 were queued in the same layer.

 [Pause].

    You might ask the following question: "Can we not avoid  this
phenomenon by sending packet 5 to a different layer?"

    In the paper we show that if the PPS   is implemented as  is,
this  phenomenon  cannot  be  avoided. We show a specific traffic
pattern, which leads to consecutive  cells  for  a  given  output
reaching  the  same  center  stage  switch.   Thus  we  show that
sometimes an input has no [stress] choice but to send a cell to a
switch,  which  already  contains  an immediately preceding cell,
destined to the same output as in this example where packet 5  is
sent to the same layer as packet 4.

    Why does this lack of choice  occur?  The  following  example
shows clearly why this occurs. : -

11)

    [Begin Animation]

     1) Let 4 cells be destined to output port 2,  in  the  first
time  slot.  Each  of  these  cells  is spread over the different
center stage switches as shown. Specifically cell  one,  two  and
three  are  sent  to  center  stage  switches  one, two and three
respectively. Cell 4 is sent to center stage switch one.

      A note on the color scheme: -

[Begin explain color scheme]

      Packets sent to output port two are shown in red. These are
the  packets  of  interest  to  us.  Packets  shown  in green are
destined to some other output port.  A  white  box  implies  that
there is no cell arriving at that time slot.

[End explain color scheme]

     2) In the next time slot, only input one  receives  a  cell.
The  cells,  which  arrived  in the previous time slot, are still
being sent to the respective center stage switches.  Assume  that
this  newly arriving cell is not destined to output port 2. Since
it cannot be buffered it must be again sent immediately. As  link
one  is  still  busy  sending  packet one to output port two, the
input used an alternate center stage switch, in this case  switch
2.

     3) Similarly in the third time slot, the  input  port  again
receives  a cell not destined to output port two and it begins to
send this cell on its next available free link, which is link 3.
     4) In the 4 th time slot, input port  one  receives  a  cell
destined to output port one. This newly arriving cell is shown in
red.

        Since the only link which is free is link one, the  input
is forced to send "packet 5" to output port one leading   to: -

12)            ... the fact that  packets  4  and  5  which  both
belong  to  the  same  output are queued in the same center stage
switch.

 [Long pause]

 So we see that we are unable to avoid consecutive  packets  from
reaching  the  same center stage switch.  Well, is there a way to
increase the choice of switches, which a given input  can  choose
so  that  input port one could have sent packet 5 to center stage
switch 2?

  [Long Pause: Allow this fact to sink in]

13)

    One obvious way to increase the choice of links from which to
choose  is to increase the speed of the intermediate links. If we
increase the speed of the internal links  by  a  constant  factor
"S";  then  we  denote  "S" to be the speedup factor. This is the
same as increasing the number of intermediate links by "S" times.

    [Do we need the following statement?. I think  we  do  as  it
makes things very clear later on - Sundar]

    In fact if the intermediate links were all speeded  up  by  a
factor "k", where "k" is the number of center stage switches then
there will be ample choice, because all links  are  free  at  any
given  time  slot. However that defeats the purpose of building a
PPS, since we don't want internal links  to  run  at  a  rate  of
K(R/k) = R.

 Can we do better? We ask the question as to whether we can  have
sufficient  choice  with a low speedup to prevent a single center
stage switch  from  being  utilized  consecutively  for  a  given
output.    In order to understand this better we shall study what
effect does increasing the speed of intermediate  links  have  on
choice.

We will take the previous example and see  what  happens  with  a
speedup of two.

    [Begin animation]

        1) Under the same traffic pattern the cells are  sent  by
the  PPS  to  the  center  stage  switches in a similar manner as
before.

    [Continue animation]

2) However we see that when cell 5 arrives these is a  choice  of
sending  it  to either switch 1 or switch 2. Layer one would have
already completed sending cell 1. But link 2 is  also  available.
This  is because the packet in link 2 was sent two time slots ago
and since the line rate of the link is  2R/3,  the  packet  would
have completed transmission when packet 5 arrives.

    [End animation]

    Hence increasing the speed of the internal links surely gives
more  choice  and  helps  prevent multiple consecutive cells from
being sent   to the same layer.

14)

    We shall now see the generic effect  of  speedup  on  choice.
Consider  an input, which receives packets at a rate of R. Assume
that it sends these packets to 10 center stage switches.  Let the
speedup  be  equal to 2. Thus each link runs at a rate given by 2
times (R/10) which is R/5.

[Begin Animation]

1) Let a cell that arrives at time slot 1 be  sent  to  any  link
say,  link  1.  2) Let a cell that arrives at time slot 2 be sent
to some other link, say link 2. We note that  link  1  cannot  be
utilized since it is still in the process of sending cell 1. This
is because link 1 is a slow speed link running at a rate of R/5.
        3) Similarly, when a cell arrives  at  time  slot  3,  we
cannot  utilize  any of links 1 and 2. So we use a third link say
link 3.
        4) Proceeding ahead, at the end of 5 time slots,  we  are
forced to utilize 5 distinct links.

     We notice that at the end of 5 time slots, link one  becomes
free once again.  Hence at time slot 6 the input can choose from,
any of the 5 as yet unused links, or from the 1 old  link,  which
has now become available to be used again, Thus in general at any
time slot, only the previous  k/S  links  would  be  transmitting
packets.  And  the  link  sent  exactly k/S time ago would become
available again to be used.

[End Animation]

Is this choice sufficient to  prevent  consecutive  packets  from
being sent to the same center stage switch?  [Pause: Let audience
think of this.]

15)

In order to understand this better, we shall now formalize  these
concepts,  which we have learnt. Consider any cell, which arrives
at input port i at time slot n.

[Read from slide]

We define the available input link set, AIL (i,n), as the set  of
layers  to  which  external input port i can start writing a cell
to, at time slot n.

[End read from slide]

We note that there is a similar  constraint  at  the  output.  To
understand  this constraint we will now define the departure time
of a cell in a PPS.

16)

[Begin animation]

   1) We again begin by referring to the output queued switch  we
are  trying  to  emulate. Let a cell arrive at time n to both the
FIFO OQ switch as well as the PPS.
   2) Then the cell is switched in the output queued  switch  and
is queued in the OQ for sending out.
   3) The cell may take some time for it to reach the head of the
OQ as cells ahead of it are read out again in a FIFO manner.
   4) Finally at some time n' in the [stress] future the cell  is
switched out.


[End animation]

17) Hence the cell departure of a cell arriving  at  the  PPS  at
time  n  is the departure time from an equivalent FIFO OQ switch,
which receives the same arrival traffic pattern.  This  departure
time is denoted by n'.


18)

Now, consider the same cell which arrives at  input  port  i  and
arrives  at  time  slot  n.  Let us assume that it is destined to
output port j, and should leave the system at time slot n' in the
future.

[Stress, go very slow]

The output j would have read the previous consecutive cells  from
certain  middle  stage  switches. Hence depending on which center
stage switches were used the output j can read from only  certain
other switches at time slot n'.

[Read from slide]

Thus, we define AOL (j,n'), as the set of  layers  that  external
output  j  can start reading a cell from, at time slot  n' in the
future.

[End read from slide]

    Note, that AIL allows the inputs to write the cells  smoothly
to different center stage switches. Similarly AOL allows cells to
be read smoothly from different center stage switches.   We  want
to  see under what conditions can we always have enough choice so
that both the inputs can write smoothly and the outputs can  read
smoothly.

19)
    This part is key to understanding the PPS.  Consider  a  cell
which  arrives  at  an  input  port  i  at time slot n. Say it is
destined to output port j at time slot n' in the future. Then the
cell must be sent to a layer from its AIL set.

    Also at the moment  when  the  cell  is  being  read  out  at
departure  time n'  in the future, it must meet its AOL. Hence if
every cell musts meet both it's AIL and AOL sets then each  input
and  each  output will be able to write or read consecutive cells
from different center stage switches.

    [Begin animation]

    We  shall  use  the  same  example  and  traffic  pattern  as
previously  shown.   We  consider  packets  only  arriving  at  a
specific input port, in this case input port one. Let  us  assume
that  these  packets  are  destined to a specific output port, in
this case output port "N".    In the  example  shown,  since  the
internal links have a speedup of two, both links 1 and 2 are free
when packet 5 arrives.

1) The layers in green belong to the AIL of input port 1.

        2) Also the AOL layers are shown in green. Output port N,
will  have  read cell 1,2,3 and 4 in that order from center stage
switches 1,2,3 and 1 before trying to read cell 5 at future  time
n'.

        3) Hence the newly arriving cell can be  read  from  only
center stage switches 2 and 3.

        4) Since the cell can be written to only switches 1,2 and
must  be read from  either of switches 2,3 it can only be sent to
the intersection of these  two switches i.e. center stage  switch
2.

[Go slow]

But what happens in a generic PPS. Does there always exist such a
layer?

20) In order to answer this question we  must  now  see  how  the
choice  sets looks like. Though the actual size of these sets may
vary, we can make a statement about the  minimum  size  of  these
sets.

    In fact the minimum size of these sets is given by :-

        = K - (The maximum number of links which can  have  cells
in progress).

        = K -K/S +1.

21) Hence to assure choice to both inputs  and  outputs  we  need
that: -

[Begin minor animation]

        AIL intersection AOL cannot be null.   Since  these  sets
define  layers  it  suffices  to say that if  |AIL| + |AOL| > K ,
then the above condition is met.

    Thus, we have that

        (k - k/S + 1) + ( K- K/S +1) > K.           This  implies
that S > 2k/k+2.

[End minor animation]

22)
    The following is the main result of the PPS:-

[Begin read from slide] [Begin minor animation]

   If S > 2k/(k+2) approximately equal to 2  then  each  cell  is
guaranteed to meet its AIL and AOL.

   If S > 2k/(k+2) approximately  equal  to  2  then  a  PPS  can
precisely  emulate  a  FIFO  output queued switch for all traffic
patterns.

[End minor animation] [End read from slide]

[Long Pause]

23)

    Till now we  have  been  tried  to  emulate  a  FIFO  queuing
strategy.  But what did we start out to achieve?   Well we wanted
to create an output queued switch using existing  switches.  Now,
this  output  queued switch might send packets out in a different
order than a FIFO manner and may re-order these packets.

24)  More  precisely,  Quality  of  Service   is   an   important
requirement of switches. We then asked the question as to whether
the PPS could support a WFQ strategy. We found that: -

[Begin read from slide]

     If S > 3k/(k+3) approximately equal to  3  then  a  PPS  can
precisely emulate a switch with WFQ, or strict priorities for all
traffic patterns.

[End read from slide]


25) Well we started out by trying  to  build  a  very  high-speed
switch.  Are the results that we that we have now practical?

    It turns out NOT to be the case.  In all our  discussions  up
till now, in order to make a decision as to which layer to send a
cell to we needed 2 pieces of information.  The AIL set was local
to the input and was maintained by each input. However all inputs
needed precise information about the output. i.e.  We  needed  to
know  how  the  cells  were  distributed amongst the center stage
switches for  a  given  output.  Thus  the  AOL  information  was
required to be globally available amongst all inputs.

   This meant a centralized algorithm wherein a decision  had  to
be  made  by  each  output  for  all inputs. This leads to a very
impractical solution when N is large.  Hence we came  up  with  a
distributed algorithm where each input maintains both the AIL and
a local copy of the AOL.  Then the following result holds true: -

    If S > 2k/(k+2) approximately equal  to  2  then  a  PPS  can
precisely  emulate  a FIFO output queued switch with a latency of
Nk/S external time slots, in a distributed manner for all traffic
patterns.

[Pause]

        This result is very important because in the  distributed
PPS   algorithm  each  of  the  inputs  act  independently  based
completely on its own local information. Hence such a distributed
PPS algorithm is very practical to implement.

    Also the latencies associated are not much, considering  that
these latencies are of the order of time slots in nanoseconds.

26)

Conclusions.

[Begin read from slide]

    We can thus conclude that it is  possible  to  build  a  very
high-speed  single  stage switch using multiple smaller switches.
Such a switch can emulate an output queued switch running at  the
aggregate rate.  [End read from slide]

    Thank You.