Techniques for Fast Packet Buffers
                  -----------------------------------

              Sundar Iyer, Ramana Rao Kompella, Nick McKeown,
                   Department of Computer Science,
                        Stanford University,
                        Stanford, CA 94305.
                    sundaes,ramana,nickm@stanford.edu

All packet switches (e.g. IP routers and ATM switches) must  have
packet buffers to hold packets during times of congestion. Fre­
quently, the maximum speed at which a packet switch  can  operate
is determined by the speed of its packet buffers. This is because
a packet buffer must have enough bandwidth to both write  (store)
and  read  (retrieve) every packet. For example, an OC768c inter­
face must be able to  write  arriving  packets  into  its  packet
buffer,  and  retrieve  them  again, as fast as they arrive.  For
40-byte minimum length IP packets, this corresponds to  at  least
one write and read operation every 8ns -- within the capabilities
of commercially available SRAM, but beyond the speeds of  commer­
cially available DRAM. Unfortunately, because the buffers must be
large, there is a need to use DRAM instead of SRAM.  Amongst  the
memory  technologies  available  today,  only DRAMs support large
buffer sizes but do not give  fast  random  access  into  memory.
Thus  the design of packet buffers for fast line interfaces poses
a real challenge which will increase as line rates increase.

Since neither DRAM or SRAM is suitable alone, a number of differ­
ent (usually proprietary) techniques are used. These include: us­
ing a bank  of  DRAMs;  cell  striping  and  packet  interleaving
amongst  multiple  DRAMs;  a  combination of SRAMs and DRAMs etc.
These techniques are generally implemented in an ad  hoc  manner,
and  often  provide  only statistical guarantees (hoping that the
worst case traffic patterns do not happen).

We aim to provide a variety of deterministic  results  and  tech­
niques  that  are  useful to the design community at large. Armed
with these results, we hope that designers need not re-invent the
wheel.

We  will  assume  a  very  general model for a packet buffer. The
buffer consists of Q FIFO queues. Q could be one (if  the  buffer
contains only a single FIFO queue), Q could be a small number (if
the packet switch performs  class-based  queueing),  Q  could  be
quite  large  (if  the  packet buffer contains the virtual output
queues of an input-buffered switch), or Q could be very large (if
the  packet buffer contains the per-flow queues of a traffic man­
ager). When packets arrive to the packet buffer, they  are  added
to  the  tail of the appropriate FIFO. Packets leave according to
the decisions of an external scheduler.  This could be, for exam­
ple, a switch arbiter, or an output link scheduler. All that mat­
ters here is that the scheduler tells the packet buffer  to  give
it the packet from the head of one of its FIFO queues. The packet
buffer cannot predict which queue the scheduler will request next
and must operate correctly no matter what sequence of requests it
receives from the scheduler. It will be useful  to  consider  two
cases:  One  in  which the scheduler is impatient and cannot wait
for packets.  In other words, when  the  scheduler  requests  the
packet  from a FIFO queue, the packet buffer must respond immedi­
ately and with - as near as possible - zero  delay.   The  second
case is one in which the scheduler is more patient and can toler­
ate a small, but bounded, delay for each packet. In this case, it
is important to bound the delay so that the retrieval process can
be pipelined and can keep up with the rate of new  requests  from
the scheduler.

We will consider the following packet buffer architecture to meet
these needs. It consists of 'b' large off-chip DRAMs operating in
parallel, and a small on-chip SRAM buffer. The 'b' DRAMs are used
to match the line rate. The SRAM is used  as temporary storage so
that  packets  can  be  retrieved quickly from the buffer. We can
think of the SRAM as a "cache" holding just the heads  and  tails
of  each of the FIFO queues. The body of each FIFO resides in the
DRAM banks. With sufficiently large 'b' and a large  enough  SRAM
cache,  an  arbitrarily fast packet buffer can be built. An obvi­
ous, and important, question to ask is: How large  does  the  on-
chip  SRAM buffer need to be to meet our design requirements? And
is there an algorithm for deciding which packets should be in the
SRAM cache so as to minimize the SRAM buffer size?

Our results are as follows:

Result  #1:  An  impatient scheduler.  If all the DRAMs share the
same address bus, then all requests from  the  scheduler  can  be
served immediately if the SRAM buffer is of size Qb(2 + ln Q). We
will present an algorithm, which we  call  MDF  (Maximum  Deficit
First) that achieves this bound.

Result  #2. A patient scheduler.  If all the DRAMs share the same
address bus, then all requests from the scheduler can  be  served
within  a  bounded  latency  of  Q(b-1)+1  time slots if the SRAM
buffer is of size Q(b-1). The latency arises because  the  buffer
management  algorithm  uses  a  lookahead buffer to gain a "sneak
preview" of the requests it will receive next. We show  that  the
buffer  of size Q(b-1) is necessary, and can be achieved using an
algorithm we call ECQF (earliest critical queue first).

Result #3. A patient scheduler, and 'b' address busses.  If  each
DRAM  is  addressed  using  a separate address bus, then the SRAM
buffer can be eliminated completely, so long as the scheduler can
wait for 2Q DRAM random access times.

As  an  example  consider the design of the output buffers for an
OC768c linecard with 32 Diffserv queues.  We can meet  the  band­
width  requirements  using  51.2ns  DRAMs  and  cells segments of
64bytes, with an SRAM buffer of 115kbits, and the ECQF algorithm.
When  the output link scheduler decides which FIFO queue to serve
next, the packet will be available within 2.9us.