WO2000001123A1 - Regulation de l'encombrement dans un systeme de communication multidestination fiable - Google Patents

Regulation de l'encombrement dans un systeme de communication multidestination fiable Download PDF

Info

Publication number
WO2000001123A1
WO2000001123A1 PCT/US1999/014541 US9914541W WO0001123A1 WO 2000001123 A1 WO2000001123 A1 WO 2000001123A1 US 9914541 W US9914541 W US 9914541W WO 0001123 A1 WO0001123 A1 WO 0001123A1
Authority
WO
WIPO (PCT)
Prior art keywords
rate
station
congestion
ack
message
Prior art date
Application number
PCT/US1999/014541
Other languages
English (en)
Inventor
Dah Ming Chiu
Miriam C. Kadansky
Stephen R. Hanna
Stephen A. Hurst
Joseph S. Wesley
Philip M. Rosenzweig
Radia J. Perlman
Original Assignee
Sun Microsystems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/336,670 external-priority patent/US6526022B1/en
Priority claimed from US09/336,660 external-priority patent/US6507562B1/en
Priority claimed from US09/336,659 external-priority patent/US6505253B1/en
Application filed by Sun Microsystems, Inc. filed Critical Sun Microsystems, Inc.
Priority to AU47229/99A priority Critical patent/AU4722999A/en
Priority to EP99930769A priority patent/EP1018248A1/fr
Publication of WO2000001123A1 publication Critical patent/WO2000001123A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/26Flow control; Congestion control using explicit feedback to the source, e.g. choke packets
    • H04L47/263Rate modification at the source after receiving feedback
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/15Flow control; Congestion control in relation to multipoint traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/27Evaluation or update of window size, e.g. using information derived from acknowledged [ACK] packets

Definitions

  • This invention relates to multicast communication in computer networks, and more particularly to flow control in reliable multicast communication.
  • Communication between computers in a computer network can be established by one of several methods. These include unicast messaging (where a source station and a destination station exchange messages over a point to point path), broadcast communication (where a sender station transmits messages which may be received by all stations attached to the network), multicast communication (where a sender station transmits messages which may be received by a predetermined group of stations), and so forth.
  • unicast messaging where a source station and a destination station exchange messages over a point to point path
  • broadcast communication where a sender station transmits messages which may be received by all stations attached to the network
  • multicast communication where a sender station transmits messages which may be received by a predetermined group of stations
  • Detecting congestion in a computer network is difficult as the congestion often happens at intermediate nodes deep within the network, and the effect of congestion occurs at end stations which do not receive packets.
  • the end station knows that it missed some packets because the packets carry sequence numbers, and the end station finds that it has missed some sequence numbers, and therefore missed the packets.
  • Once an end station has detected congestion it can send a message to the transmitting station so that the transmitting station can reduce the rate at which it is transmitting data onto the network.
  • a transmitting station Upon detection of congestion, a transmitting station ordinarily reduces its rate of transmission. The reduction is often done by multiplying the current transmission rate by a reduction fraction, for example 0.50 for a 50% reduction, or for example by 0.25 for a 75% reduction. It is common engineering practice to then have the transmitting station increase its rate of transmission when there is no congestion reported. The increases are often accomplished by adding a constant amount to the current rate of transmission, and to continue adding the constant amount periodically until a desired transmission rate is reached, or once again congestion is detected on the network. Upon again detecting congestion, the rate of transmission is again reduced by multiplying the current rate of transmission by the reduction fraction, and again beginning a sequence of additive increases in rate of transmission. The use of multiplicative rate reduction and additive rate increase is an algorithm employed in standard engineering practice.
  • a small reduction in the transmission rate may not be enough reduction to cure the congestion with one rate reduction (quickly cure the congestion), where alternatively, a reduction fraction such as 0.25 (75% rate reduction) may cure the congestion with one reduction step, but then may take a long time for additive increases to bring the rate back to its initial value.
  • a sequence of rate reductions using a reduction fraction as large as 0.25 or 0.50, in the network environment of persistent congestion may cause the rate of transmission to sink to an unacceptably low value. That is, repeated multiplicative rate reductions may tend to drive the transmission rate to a very low value, so low that the transmission of the desired message becomes impractical.
  • a method for a transmitting station to respond to congestion in a computer network that both reduces the transmission rate so as to cure the congestion, and also is stable in an environment of persistent congestion on the network is needed.
  • a window is a length of time during which a transmitting station is permitted to transmit onto the network, then during a waiting time the station does not transmit. Upon expiration of the waiting time, the window is again "open” and the station again transmits for the permitted length of time.
  • the average transmission rate is controlled by use of a window because during the open window length of time, the station transmits at the rate determined by the communications media, and during the waiting time the station does not transmit. Although the average rate of transmission is controlled, the network must absorb a burst of message traffic during the open window length of time.
  • Acknowledgment by a receiving station that it has received all of the messages transmitted by a station is accomplished by the receiving station sending an acknowledgement (ACK) message to the transmitting station.
  • the receiving station knows that it has received all of the transmitted messages because the messages contain a sequence number, and the receiving station keeps track of the sequence numbers of messages which it has received.
  • standard protocols permit the receiving station to either simply fail to send an ACK message to the transmitting station and an ACK timer in the transmitting station expires to trigger retransmission of the missing message.
  • protocols permit the receiving station to transmit a non-acknowledge (NACK) message to the transmitting station as soon as the receiving station determines that it is missing a message, and receipt by the transmitting station of the NACK message triggers retransmission of the missing message.
  • NACK non-acknowledge
  • a receiving station transmits its ACK or NACK messages to the transmitting station as soon as it determines the status of the messages.
  • the ACK and NACK messages may be included with other message traffic which the receiving station is sending to the transmitting station, a process referred to as "piggybacking".
  • a method of detecting congestion in a computer network uses a receiving station which determines a first number of messages missing in a first acknowledgment window. The station then determines a second number of messages missing in a subsequent acknowledgement window. The station then measures congestion on the network in response to an increase between the first number of missing messages in the first acknowledgement window and the second number of missing messages in the second acknowledgement window.
  • a transmitting station responds to messages indicating congestion on the network by reducing its transmission rate by multiplying the current rate by a reduction fraction
  • a small number of steps is chosen, for example four (4) steps, so that the station will be back at the measured transmission rate by adding the constant amount only, for example four times, independently of how low a rate a sequence of reductions drive down the rate of transmission.
  • the amount of additive increase in the rate of transmission is computed by the formula:
  • “current rate” is the rate of transmission after the last reduction was done;
  • “M” is a small constant, for example 4, and is the number of additive steps which the station needs to perform in order to return the transmission rate from the "current rate” to the "historically highest rate” .
  • the "historically highest rate” is measured by the station, in one embodiment of the invention, by determining the highest rate that the station has achieved in the network since the beginning of a session. By always attempting to return to this measured rate after each rate reduction, the station avoids converging its rate of transmission to an unacceptably low value.
  • a multicast repair tree is established, the repair tree having one sender station and a plurality of repair head stations.
  • a repair head station has an affiliated group of member stations.
  • a repair head station retransmits a lost message to its affiliated group of member stations upon receipt from a member station of a NACK message indicating that the selected message was not received.
  • Acknowledgment windows (ACK windows) are established in a member station for transmission of ACK or NACK message by the member station.
  • a number of messages transmitted by the sender station during a transmission window is established. Also a same size of ACK window is established in the receiving stations, with a slot in the ACK window corresponding to each message transmitted by the repair head station.
  • Each receiving station is assigned a slot in the ACK window during which time that receiving station transmits its ACK or NACK messages.
  • the ACK window slots assigned to receiver stations for transmission of ACK/NACK windows are staggered so that different receiver stations transmit their ACK/NACK messages at different times.
  • the assignment may be done: for example, by the receiving stations using a random process to choose its ACK window; or for further example, the sender station or the repair head station may assign the ACK window to a receiving station by use of management messages, etc.
  • the sender station transmits thirty two (32) messages during its transmission window.
  • an ACK message contains a sequence number of the last correctly received message from the last transmission window of messages, and contains thirty two (32) bits, each bit representing a message with the following sequence number. When the bit is clear the corresponding message was correctly received, but when the bit is set the corresponding message was not received. The set bit then triggers the repair head station to retransmit the message.
  • Fig. 1 is a block diagram of a multicast repair tree in accordance with the invention.
  • Fig. 2 is a schematic block diagram of a computer internetwork comprising a collection of interconnected communication media attached to a plurality of stations, including end stations;
  • Fig. 3 is a schematic block diagram of a station, such as an end station, configured to operate in accordance with a reliable multicast transport protocol of the present invention;
  • Figs. 4A, 4B, and 4C are schematic diagrams of a network with many stations formed into repair groups, and the diagrams represent a model used for simulation.
  • Fig 5 is a graph of the response of a simulated network
  • Fig. 6 is a graph of the response of a simulated network.
  • Fig. 7 is a block diagram of protocol stacks for communication between computers.
  • Fig. 8 is a block diagram showing a multicast repair tree.
  • Fig. 9 is a block diagram of a multicast tree showing receiver group relationships.
  • Fig. 10 is a block diagram of a HState transition diagram.
  • Fig. 11 is a chart showing various messages and sub-messages used in TRAM.
  • Fig. 12 is a table showing the timers used by TRAM.
  • Fig. 13 is a table showing counters used by TRAM.
  • Fig. 14 is a block diagram showing a multicast packet format for a sender Beacon Message.
  • Fig. 15 is a block diagram showing a packet format for a Data Message.
  • Fig. 16 is a block diagram showing a packet format for a HA Message.
  • Fig. 17 is a block diagram showing a packet format for a MS Message.
  • Fig. 18 is a block diagram showing a packet format for a Hello Message.
  • Fig. 19 is a block diagram of a packet format for a ACK Message.
  • Fig. 20 is a block diagram of a unicast packet format for a Hello Uni Message.
  • Fig. 21 is a block diagram of a packet format for a Head Bind Message.
  • Fig. 22 is a block diagram of a packet format for a Accept Membership Message.
  • Fig. 23 is a block diagram of a packet format for a Reject Membership Message.
  • Fig. 24 is a block diagram of flag fields for a Sender Beacon Message.
  • Fig. 25 is a block diagram of flag fields for a Hello Message.
  • Fig. 26 is a block diagram of flag fields for a ACK Message.
  • Fig. 27 is a block diagram of flag fields for a Hello- Uni Message
  • Fig. 28 is a block diagram of flag fields for a HA Message.
  • Fig. 29 is a block diagram of flag fields for a Data Message.
  • the reliability of communication for example a file transfer comprising a plurality of messages, is an important element in computer networking.
  • Reliable unicast communication is established by implementations based upon the concept of a protocol stack.
  • a protocol stack has several levels. At the lowest, or "physical”, layer, a physical connection is established between two computers.
  • the physical connection permits hardware in the "physical” layer to exchange signals between two computers.
  • the "data link” layer frames are constructed in accordance with the requirements of the communication protocol used on the physical layer.
  • the data link layer provides a best effort, but unreliable, transfer of packets between a sending computer and a receiving computer. Each packet is numbered by a "sequence number" for use by that layer of the protocol for establishing reliable communication.
  • the next higher layer permits establishment of reliable communication.
  • a cache of already transmitted packets is maintained, including the sequence number of each.
  • the receiver checks the sequence number of the received packets and determines if any packets are missing. Packets may be missing because of congestion on the network, unreliability of the medium, static on the line, or any one of many possible reasons.
  • the receiver transmits an acknowledgment (ACK) message to the transmitter indicating that a packet has been received, and also transmits a negative-acknowledge (NACK) message when it determines that a packet with a particular sequence number is missing.
  • ACK acknowledgment
  • NACK negative-acknowledge
  • the transmitter Upon receipt of an ACK, the transmitter flushes the packet from its cache (retransmit cache) used for retransmission of lost packets. Upon receipt of a NACK, the transmitter queues the packet from its retransmit cache and retransmits the packet. The transmitter continues to wait for receipt of an ACK before flushing the packet from its retransmit cache.
  • Some protocols use a time-out period with a timer rather than using ACK and NACK messages to signal that a packet should be retransmitted. Some protocols establish reliable communication on every hop of a communication pathway, and some do not.
  • IP portion Internet Protocol
  • the IP portion of TCP/IP is a layer 3 protocol, and is used to establish unreliable transfer of messages between end stations, for example across the Internet. Layer 3 handles addressing, routing, etc.
  • the TCP portion of TCP/IP (The Connection Protocol) is a layer 4 protocol and establishes reliable communication between the end stations by causing retransmission of packets using the IP protocol.
  • a “frame” is used as the messaging unit transferred by the physical layer on a hop between two computers.
  • Unreliable multicast communication is relatively simple to implement, as the source station simply transmits the datagrams with an address that the designated computers can recognize as a multicast address, and which routers forward. The destination stations then receive any datagrams which they detect. No attempt is made to either identify or retransmit lost datagrams.
  • One solution to the reliable multicast problem where the multicast message is to be received by a group of destination computers, has been to have an administrator (a person or a computer program operated by the person) set up a repair tree.
  • an administrator a person or a computer program operated by the person
  • certain computers are designated as a "repair head”.
  • the rest of the computers of the group of destination computers are assigned to a designated repair head.
  • a source station transmits a multicast datagram onto the network.
  • the datagram should be received by all members of the destination group. Since the datagrams carry a sequence number, each destination station determine if it has missed a datagram.
  • Each station sends an ACK to its repair head upon successful reception of a window of datagrams, and sends a NACK to its repair head upon determining that it has missed a datagram.
  • the repair head flushes the datagram from its cache. The repair head retransmits any datagram for which it receives a NACK, until all members of its repair group respond with an ACK for each datagram.
  • a repair head In the event that a repair head is missing a datagram, it NACKs to the source station, and the source station retransmits the datagram.
  • the source station maintains a cache of transmitted datagrams and flushes them after receipt of an ACK from each of the repair heads affiliated with the original source station.
  • Congestion on the network can result from large numbers of ACK and NACK messages. Particularly, a destination station which is slower than the transmitting source station will miss many multicast datagrams. The resulting NACK messages can cause a NACK implosion and contribute to network congestion. Upon receipt of a NACK message, a source station or repair head will begin retransmission of datagrams, thereby contributing to even more congestion. Congestion can particularly increase when a low bandwidth link is responsible for a number of destination stations being slower than the source station. Each destination station will miss numerous datagrams, and will flood the network with NACK messages, followed by more retransmissions in a feedback cycle which increases congestion.
  • Fig. 1 there is shown a multicast repair tree 100.
  • Sender station 102 is transmitting a multicast message to the other stations shown in Fig. 1.
  • Communication path 104 represents the fact that sender station 102 transmits a message having a multicast address, and this message is received by all of the addressed stations, 110-1, 110-2, ... 110-N directly from sender station 102.
  • Communication path 104 may include, physically, many hops through many physical networks.
  • Communication path 104 simply represents that the destination stations receive the multicast message directly from sender station 102.
  • Multicast repair tree 100 may be established as a static structure, by for example a person establishing the sender, head, and member status on each station.
  • the person normally has "network manager" status, and uses a computer program to establish the tree 100 by configuring the various stations through setting status in each member station by use of management control messages.
  • multicast repair tree 100 may be dynamically configured, as described more fully in the related application by M. C. Kadansky, et al. entitled “Dynamically Configured Tree Based Repair in Reliable Multicast Protocol", incorporated hereinabove by reference.
  • sender station 102 transmits beacon messages in order to assist in establishing the repair tree 100.
  • Beacon messages transmitted by sender station 102 are also used in management of congestion control, in accordance with the invention.
  • Destination station 1 10-4 is selected to be a repair head by the tree forming process, either static or dynamic. Destination stations 1 10-1 , 1 10-2, 1 10-3, and 1 10-4 form a repair group 1 15. Repair head 1 10-4 caches messages received from sender station 102, and repair head 1 10-4 transmits ACK messages to sender station 102 along path 120 as numbered messages are successfully received by repair head 1 10-4.
  • Sender station 102 maintains a cache of messages which it has transmitted, and maintains a log of ACK messages received from various repair head stations so that it can clear a message from its cache after ACK messages have been received from all repair head stations, as will be more fully described hereinbelow.
  • Path 104 represents the multicast path where data, retransmission, and beacon messages flow.
  • Paths 120, 122, 124 and 126, etc. represent unicast flows of ACK and congestion messages.
  • Repair head 1 10-4 receives ACK messages from the destination stations in its repair group 1 15, including destination station 1 10-1 along path 122, destination station 1 10-2 along path 124, and destination station 1 10-3 along path 126. Repair head 1 10-4 maintains a cache of messages transmitted by sender station 102, and upon receipt of ACK messages from all of the member stations of its repair group, deletes the message from its cache.
  • Repair group 1 15 is illustrated in Fig. 1 as having four (4) members, receiver members 1 10-
  • a repair head may have many members in its repair group.
  • repair head 1 10-9 acts to receive the ACK messages from members of its repair group 1 15, and "repairs" missing messages transmitted by sender station 102 to destination stations 1 10-1 , and 1 10-2, and 1 10-3 of its repair group 1 15.
  • repair head 1 10-4 provides reliable multicast communication from sender station 102 to members of repair group 1 15.
  • Repair group 117 has as members stations 110-5, and 110-6, and 110-7, and 110-18, with member station 110-7 being the repair head.
  • Repair head 110-7 caches messages received from sender station 102 and transmits its ACK messages to sender station 102 along path 130.
  • Sender station 102 then retransmits a message for which it receives a NACK message from repair head 110- 7.
  • Ordinary members of repair group 117 transmit their ACK messages to repair head 110-7: station 110-5 along path 132, station 110-6 along path 133, and station 110-18 along path 135.
  • Repair head 110-7 maintains a cache of all messages transmitted by sender 102, and deletes the messages as soon as an ACK is received from each of the member stations of repair group 117.
  • Repair group 119 illustrates a second level in the repair tree hierarchy.
  • Station 110-18 is a member of repair group 117.
  • Station 110-18 is also a repair head for repair group 119.
  • Repair group 119 has members 110-18, its repair head, and also station 110-8, station 110-9, and station 110-10.
  • Repair head station 110-18 maintains a cache of messages transmitted by sender station 102. Any messages missed by repair head station 10-18 are repaired by use of path 135 for sending ACKs (NACKs) to its repair head 110-7.
  • Repair head station 110-18 receives ACK messages from member stations: station 110-8, station 110-4, and station 110-10, and when an ACK has been received from all member stations of its repair group, repair head station 10-18 deletes the message from its cache.
  • repair group 140 has repair head 110-13 with additional member stations 110-11, station 110-12, and station 110-14.
  • Repair head station 110-13 transmits its ACK messages to sender station 102, and so is in the first level of the hierarchical multicast repair tree.
  • Station 110-14 is also a repair head station for repair group 150, and so is a second-level repair head station.
  • Member station 110-17 of repair group 150 is also a repair head station for repair group 160, and so is a third-level repair head station in the repair tree hierarchy.
  • the ACK messages are distributed among a plurality of repair head stations.
  • the number of members of each repair group are limited so that each repair head station can handle the ACK messages, and can also handle the retransmission of messages for which NACK information is received. No "ACK implosion” and no "NACK implosion" occur, both because the repair work is distributed over many computer stations, and congestion and flow control prevent excessive packet loss, and so reliable multicast communication is established.
  • the invention avoids an ACK implosion by spreading out the ACK (and NACK) messages so that a flood of them do not reach the repair head simultaneously.
  • the use by members of the ACK window for timing of transmission of the ACK messages helps to prevent too many ACK messages from reaching the transmitting station at the same time.
  • the ACK messages contain both acknowledgment information for packets received by the member station, and contain NACK information for packets not received by the member station, as based on the sequence numbers of the packets.
  • the term "ACK message" will be used throughout this patent to indicate a message returned by a receiving station to a transmitting station, where the message carries both ACK and NACK information.
  • the ACK window is defined for a multicast session by establishing the number packets which make a full sequence of ACK windows. Receipt of a full window of packets is an event which triggers transmission of an ACK message by a member station.
  • the ACK window size is configurable, and the default number of packets which make a full sequence of ACK windows is thirty two (32) packets.
  • ACK messages are distributed over the next ACK window.
  • Each member is assigned a window (for example between 1 and 32) for sending its ACK messages. For example, one member may send ACKs after receiving messages 32, 64, 96; etc., while another sends ACKs at messages 10, 42, 74, etc.
  • the ACK messages may be sent as the next window of packets are being received, because at the levels of the protocol stack at which the invention operates, communication is full duplex.
  • acknowledgments are also sent if an ACK timer counts up to, for example, greater than 1.5 times an estimated ACK interval.
  • the estimated ACK interval is computed at each receiver when an ACK is sent.
  • the estimated ACK interval estimates the amount of time it takes to receive an ACK window's worth of messages. The formula is:
  • ACK interval ACK window * (Time since last ACK / Packets since last ACK)
  • this timer indicates that the sender has paused and allows members to report and recover any lost packets without having to wait for the sender to start sending new data.
  • the ACK message format which both reports NACK information (that is, packet loss information) and acknowledgment information (ACK information) uses a bit map length field. Each ACK message contains the bit map length field and a start sequence number. If no packets were missing, the bit map length is "0" and the sequence number indicates that all packets prior to and including this packet were successfully received. The repair head saves this information and uses it to flush packets from its cache.
  • the start sequence number indicates the first missing packet.
  • a bit map must follow. Each bit in the map represents a packet sequence number starting with the start sequence number. If the bit is set, for example, that packet is missing and must be retransmitted.
  • the repair head When the repair head receives an ACK message with a missing packets bit map, the sequence number specified minus 1 is saved for this member. This indicates that all packets prior to and including this sequence number have been received successfully. The repair head then scans the bit map looking for missing packets. It immediately places these packets onto the transmit queue unless they have recently been retransmitted or are already on the queue from another request. Missing packet retransmission receives first priority in the transmission queue, so that packets may be flushed from the transmitter cache.
  • the cache buffers in the source station, and in the repair head stations hold packets which have been transmitted but for which an ACK has not yet been received. It is necessary to prevent overflow of this buffer, and accordingly, the "fill level" of this buffer is monitored.
  • a threshold is assigned for the fill level. When the threshold for the fill level is exceeded, the sender station stops sending packets and waits for ACK messages so that it can flush acknowledged packets from its buffer. This wait is a pause in the transmission of packets, and causes the ACK timer in members to expire, and the members then to transmit an ACK message.
  • the transmitting station Upon receipt of an ACK message indicating that packets have been lost, the transmitting station reduces its transmission rate by multiplying its current rate by a fraction, for example 0.25 (multiplicative decrease).
  • a transmitting station has a "slow start". That is, the transmitting station begins transmission at a "slow” or minimum rate of packet transmission and slowly increases its rate by "additive increase” until it receives notice that packets are being lost, at which time the transmitting station reduces its transmission rate. After the reduction, the station again begins increasing its transmission rate. The transmission rate therefore oscillates, and attempts to fully utilize the bandwidth of the network.
  • a key point about the slow start phase of the multicast transmission session is that the historical high rate is established as the maximum rate for which packets were not lost.
  • a further feature of the invention in retransmission operation is to avoid duplicate retransmission. Whenever an ACK message indicates that a packet must be retransmitted, first a check is done in order to determine if that packet is already in the transmit queue. If the packet is already in the transmit queue, the new ACK request is ignored.
  • Stable system operation is achieved by the invention.
  • synchronization between feedback, from reading ACK messages, and control in increasing or decreasing the transmission rate is maintained in order to achieve stable system operation.
  • Synchronization is maintained by requiring the transmitting station to immediately decrease the rate after receiving feedback which indicates congestion. The station is then required to wait for a number of windows before implementing either another decrease or an increase in rate. This wait permits the effect of the change to be felt throughout the network, and for the most recent feedback to be in response to that change.
  • Pruning of receiving stations from the multicast network is done when a station gives evidence that it cannot keep up with the transmitting station when it is operating at its minimum transmit rate. Evidence that a station cannot keep up appears at the transmitting station as an excessive number of retransmission requests. When the number of retransmission requests becomes excessive, the offending station is dropped (that is, pruned) from the multicast tree.
  • a still further feature in accordance with the invention is that an "end of packet" beacon frame is transmitted by the source station after all packets of the multicast transmission sequence have been transmitted. This "end of packet" beacon frame informs the various stations of the sequence number of the last packet, so that retransmission requests can be appropriately formulated.
  • Communications Fig. 2 is a schematic block diagram of a computer internetwork 200 comprising a collection of interconnected communication media attached to a plurality of stations.
  • the communication media may be configured as local area networks (LANs) 210 and 220, although other media configurations such as point-to-point network links may be advantageously employed.
  • the stations are typically computers comprising source and destination end stations 302, 312, such as personal computers or workstations, and intermediate stations 320a-e such as routers, bridges, switches and/or firewalls. Communication among the stations is typically effected by exchanging discrete data frames or packets between the communicating nodes according to a predefined protocol, such as the Internet protocol (IP), Internet Packet Exchange protocol, AppleTalk protocol or DECNet protocol.
  • IP Internet protocol
  • IP Internet Packet Exchange protocol
  • AppleTalk protocol AppleTalk protocol
  • DECNet protocol DECNet protocol
  • An exemplary embodiment of the invention is referred to as the Tree based Reliable Multicast protocol, or TRAM model as a shorthand.
  • the inventive multicast transport protocol utilizes a hierarchical tree-based multicasting system for transferring identical data from a single sender to multiple receivers (clients).
  • the sender and receiving clients in a multicast session interact with each other to dynamically form repair groups.
  • the repair groups are linked together in the hierarchical tree with the sender at the root of the tree.
  • Each repair group has a receiver that functions as a group head for other receiving clients in the tree. Except for the sender, each repair group head in the system is a member of another repair group.
  • Group members report lost and successfully received messages to their associated group head using a selective acknowledgment mechanism.
  • the repair heads store ("cache") every message received from the sender, and provide repair services (i.e., retransmission) of messages that are reported lost by the members.
  • Additional repair services i.e., retransmission
  • Each repair head monitors the operation of the members of its respective repair group to ensure that the members are functioning properly. Likewise, each of the members of a given repair group monitor the operation of the repair head associated with that group to ensure proper functioning of the head. If a repair head determines that a member of its group is no longer functioning (e.g., as a result of failure of the member to acknowledge receipt of special monitoring messages after a predetermined number of messages have been transmitted and/or a predetermined time period for response has elapsed), the repair head may prune that member from its group.
  • a member of a repair group determines that the current repair head that it is associated with is no longer functioning properly (e.g., if the member does not receive special monitoring messages from the head)
  • the member may seek to re-affiliate itself with a different repair head that it has learned of as a result of receipt of monitoring messages from that different head.
  • the members of the group may also re-affiliate themselves with a different repair head if the current repair head with which they are associated resigns from being the repair head for that group. Such resignation may occur if the repair head determines that it is redundant in the region of the system in which it resides.
  • the flow and congestion control mechanism is generally rate-based and adjustable, based upon network congestion. That is, the transmission rate of multicast data packets is dynamically adjusted based upon the rate at which the receiving clients can accept and process the data packets.
  • a pruning mechanism is also provided whereby receiving clients that are unable to receive and process the data packets at a minimum threshold rate are removed from the tree system.
  • each multicast data packet transmitted from the sender includes a unique sequence number. Receiving clients utilize these numbers to detect out-of-order and missing packets, and to request transmission of same from the repair head with which it is associated.
  • Each of the repair heads maintains a cache of multicast packets received from the sender and flushes the packets out of the cache after receipt of the cached packets by all of the members of its repair group has been acknowledged.
  • each member of a repair group selects a random packet between one and a predetermined acknowledgment window size to begin transmission of acknowledgment messages.
  • each repair head computes the average data rate of all packets it receives, and sends retransmissions to the members of its group at this rate.
  • Congestion is detected at the receiving clients and repair heads, and is used to dynamically adjust transmission rate of packets in the system. More specifically, the receiving clients transmit congestion messages to their repair heads based upon changes in the number of data packets that the receiving clients failed to receive between the preceding two acknowledgment windows.
  • the repair head receives these congestion messages, it generates a congestion message for each acknowledgment window and forwards that message to its own repair head.
  • Each repair head also generates congestion messages when its data cache (i.e., for retransmission purposes) equals or exceeds a predetermined maximum fill level.
  • the repair head may also adjust upwardly its maximum cache fill level, if possible.
  • the sender adjusts its data transmission rate based upon the congestion it receives as well as its own cache fill level for its immediate group members, while staying within predetermined minimum and maximum data transmission rates.
  • the sender increases transmission rate every second acknowledgment window in the absence of congestion reports.
  • the sender immediately reduces transmission rate and records the window for which the congestion report was generated, and thereafter, the sender does not further adjust transmission rate until a predetermined number N of acknowledgment windows have transpired, wherein N is proportional to the current data transmission rate divided by the historically highest achieved transmission rate.
  • the sender After each rate decrease, the next increase in transmission rate is equal to the historically highest achieved rate minus the current data transmission rate, divided by a number, for example 4.
  • the sender After receipt of a congestion report, the sender reduces its data transmission rate by a predetermined percentage (e.g., 50% or 25%) of current data transmission rate.
  • the sender's data transmission rate never exceeds, and never falls below, respective predetermined maximum and minimum thresholds.
  • the sender notifies all members of the session when it has completed data transmission by transmitting a beacon packet that includes the sequence number of the last data packet transmitted. The sender retransmits this packet periodically until all of the members of its immediate repair group have acknowledged receipt of all packets sent.
  • a member When a member receives the beacon packet, it immediately sends an acknowledgment to its repair head indicating whether it has received all of the packets transmitted, or requires packet retransmission. If the beacon from the sender is received, but a member has not acknowledged receipt of all data packets, a monitoring message is transmitted from the repair head associated with that member. If the member does not acknowledge receipt of such message to the repair head sending the monitoring message, the repair head may retransmit the monitoring message. If, after a predetermined number of retransmissions of the monitoring message, the member has still failed to acknowledge receipt, the repair head prunes the member from the tree. When all members have either acknowledged receipt of all data packets to the repair head or have been pruned from the tree, the repair head terminates its session.
  • Receivers are organized into a tree, with a sender being at the root of the tree.
  • the tree is used primarily for distributing the load of retransmitting lost packets, but also serves as a channel for aggregating feedback from receivers to the sender. This tree is referred to as the repair tree. ACKs and scattering them apart.
  • Each multicast packet has a sequence number.
  • the multicast packets are grouped into windows, each window having a predetermined number of packets, for example, packets (with sequence numbers) 1-32 are grouped into Window 1, packets 33-64 are grouped into Window 2, etc.
  • a receiver reports to its parent (in the repair tree) those packets in the window that were received and those which were lost.
  • each receiver in the group associated with the parent tries to choose a different point in the window to report to its parent, so that the parent does not receive a flurry of reports at once.
  • the parent is responsible for retransmitting the lost packets to its children. To fulfill this function, the parent keeps all packets that have not been reported as received. These packets are preferably kept in a retransmission buffer.
  • Congestion detection Either of the following two conditions triggers congestion feedback:
  • a receiver which is a repair node (non-leaf repair head) in the repair tree determines that its retransmission buffer has reached a high water mark (e.g. 4 ACK windows of packets).
  • a high water mark e.g. 4 ACK windows of packets.
  • Congestion feedback includes the following information:
  • the congestion feedback is sent by the receiver to its parent. It is usually sent as part of an ACK message.
  • Each parent keeps track of the latest window for which a congestion report has been made. If a congestion feedback contains a window number no higher than the last counted window reported, it is ignored; otherwise, it is propagated up towards the root of the tree (and the latest window at the current node is reset).
  • packets are scheduled for transmission according to a pre-determined rate. This is achieved by injecting the right amount of sleep time between packet departures so that the average data rate matches the predetermined rate.
  • the sender computes the following:
  • the rate is adjusted based on congestion feedback, staying in between a pre-configured minimum and maximum rate.
  • the goal is to adaptively find the optimal rate based on the available bandwidth of all the links involved and the speed of all the receivers involved. In the absence of congestion feedback, the rate is gradually increased; in response to congestion, the rate is decreased.
  • a key is how exactly to increase and decrease so as to adapt to the changing optimal rate quickly, while minimizing oscillation due to overshooting, undershooting and other non-optimal behavior.
  • the amount of adjustment to the rate in the face of congestion is adaptively determined. For example, if the bottleneck bandwidth is 100 Mb/s, then increasing 10 Kb/s at a time is too small; on the other hand, if the bottleneck bandwidth is 100 Kb/s, increasing 1 Mb/s at a time is too large.
  • One adaptive technique is to increase or decrease the rate by a percentage (e.g., 10%) of the current rate. Using this technique, following a period of severe congestion the current rate would be very low; 10% of that low rate would lead to a long time, if not failure to recovery.
  • Rate oscillation is by design the way to adapt to the network as it changes over time. Even if the network stays still, the sender will keep increasing its rate in the absence of congestion until it senses congestion, and decrease its rate. The ideal outcome is to oscillate by a small amount (say +/- 25%) around the optimal bandwidth.
  • the rate controller does not increase the rate beyond this maximum.
  • the increment is added to the current rate when no congestion is sensed after a small number of windows of packets transmitted, for example 2 windows.
  • a "historical high rate” is remembered. From that point on, the amount the rate is increased each time is based on this historical high. Later, when a new high is reached, the historical high is replaced by the new value.
  • the initial phase of searching for the historical high is referred to as the “initial network sensing phase” (analogous to the "slow start phase” in TCP). Rules for Decreasing the Rate.
  • the sender decreases the rate by a percentage (for example 50%) without going below the minimum rate. Decreasing by a percentage is a relative value. It adapts to different networks. The use of a rather large value, for example 50%. is justifiable because when congestion is detected the rate typically has already gone far beyond the optimal rate.
  • the current transmission rate is not a good indicator of current throughput (e.g., that achieved during a window):
  • the sender monitors its actual achieved rate for each interval between successive rate changes. This includes transmission of both new packets and retransmission of old packets. After a rate increase, if the new rate is higher than the actual rate, the rate is reset to the average of the new rate and the actual rate. By limiting the rate from becoming too far ahead of the actual rate, this algorithm makes the transmission less bursty and avoids losses in some scenarios, without significantly affecting long term throughput.
  • the receivers report congestion (or lack thereof) every window full of packets (as part of the ACK).
  • the window size is one of the basic parameters of a multicast session. If not configured, a default is used.
  • the sender reacts to congestion (or lack thereof) for every n windows, where n is a function of the following factors: current rate. historical high rate. latency information (from receiver to sender). number of levels of repairers between senders and receivers.
  • n max ⁇ N, (current rate /historical_high *M ⁇ where N and M are constants, such as 2 and 4.
  • the sender temporarily stops sending packets over the network if more than H windows of packets are in the network without any report of their status, where H is a constant, for example 4. This is used to limit the extent the receiver and sender get out of synch with each other.
  • the sender uses its current transmission rate to schedule retransmissions.
  • Each repairer uses the average rate (from the beginning of transmission to current time) as the rate to schedule retransmission. Retransmissions go out before new packets at the sender. Additional information in feedback to account for retransmission.
  • Congestion feedback may contain additional information to help the sender deal with congestion. These are:
  • the congestion report comes from a receiver in the leaf of the tree, it may contain the number of missing packets at this receiver; this helps whoever is to do the repair for this receiver estimate the amount of time needed;
  • the congestion report comes from an interior node in the reporting tree, it may contain the time estimate for doing retransmissions.
  • the scalable, reliable multicast transport protocol supports bulk data transfer with a single sender and multiple receivers of a computer internetwork, such as an intranet or Internet.
  • TRAM uses reliable multicast repair trees that are optimized to implement local error recovery, and to scale to a large number of receivers without substantially impacting the sender.
  • the protocol includes a flow and congestion control technique that enables reliable, efficient and fair operation of TRAM with other protocols across a wide variety of link and entity characteristics of the computer internetwork.
  • the TRAM model is tree-based.
  • the ACK reporting mechanism is window-based, including optimizations to reduce burstiness and processing overhead.
  • the flow control mechanism is rate-based and adapts to network congestion.
  • the sender senses and adjusts to the rate at which the receivers can accept the data. Receivers that cannot keep up with a minimum data rate can be dropped from the repair tree.
  • TRAM TRAM
  • TRAM guarantees delivery of data to any receiver in the tree and is able to keep up with the minimum transmission speed specified by the sender. While this level of guarantee cannot ensure applications against delivery failure, features can be used to closely keep track of individual members' status.
  • Each member of the tree periodically reports statistics to its repair head. This includes statistics that assist in building the tree in dynamic tree embodiments of the invention (for instance, the number of available repair heads on the tree) as well as reports on congestion conditions. Reports on congestion conditions from repair heads allow the sender to adapt its data rate to network conditions. This information is aggregated at each level of the tree in order to reduce control traffic to the sender.
  • Each repair head is responsible for ensuring that the data is received by all of its members, This means that a repair head must cache data until it is sure that all of its members have received it.
  • TRAM requires positive acknowledgments from members when data is received. This enables repair heads to reclaim cache buffers containing data that has been received by all members.
  • Both members and repair heads monitor each other to detect unreachability. Non-responsive members can be dropped from the repair group and corresponding cache buffers can be reclaimed.
  • Non-responsive repair heads can be abandoned by their members in favor of an active repair head.
  • Repair heads are also responsible for detecting receivers which cannot keep up with the minimum transmission rate specified by the sender. While such members cannot be dropped from the multicast group, they can be denied repair head support and receive no repairs.
  • TRAM has been designed to be scalable in many situations, such as large numbers of receivers and sparsely or densely populated receiver groups. TRAM also accommodates wide ranges of receiver capabilities. Control message traffic is designed to be limited in all of these cases.
  • Flow Control TRAM is designed to transfer bulk data from one sender to many receivers.
  • the data is transmitted at a rate that adjusts automatically between a specified minimum and maximum.
  • Sequence numbers in the data packets allow receivers to identify missing packets.
  • Each member is bound to a repair head that retransmits lost packets when requested.
  • Acknowledgments sent by receivers contain a bitmap indicating received and missing packets. Missing packets are repaired by the repair head. Packets acknowledged by all members are removed from the repair head's cache.
  • the sender in a TRAM application transmits data packets to every receiver in the multicast group.
  • TRAM sends the packets at a specified rate.
  • Each packet is given a unique sequence number starting with one (1).
  • Receivers use these numbers to detect out of order and missing packets.
  • the ACK window size is configurable; the default is 32. For example, in the default case, members send ACKs every 32 packets.
  • ACK messages are distributed over the window.
  • Each member selects a random packet between 1 and the ACK window size to start sending ACK messages. For example, one member may send ACKs at packets 32, 64, 96; etc., while another sends ACKs at packets 10, 42, 74, etc.
  • Acknowledgments are also sent if a timer equal to 1.5 times the estimated ACK interval expires.
  • This timer is canceled if an ACK is sent using the triggering mechanism described above. If this timer expires, it indicates that the sender has paused and allows members to report and recover any lost packets without having to wait for the sender to start sending new data.
  • Each ACK message contains a start sequence number and a bit map length. If no packets were missing, the bit map length is "0" and the sequence number indicates that all packets prior to and including this packet were successfully received.
  • the repair head saves this information and uses it to remove packets from its cache.
  • the start sequence number indicates the first missing packet.
  • a bit map must follow. Each bit in the map represents a packet sequence number starting with the start sequence number. If the bit is set, that packet is missing and must be retransmitted. A bit map length indicates how many valid bits are present.
  • the repair head When the repair head receives an ACK message with a "missing packets" bit map, the sequence number specified minus 1 is saved for this member. This indicates that all packets prior to and including this sequence number have been received successfully. The repair head then scans the bit map looking for missing packets. It immediately places these packets onto the transmit queue unless they have recently been retransmitted or are already on the queue from another request.
  • a repair head When a repair head receives a request to retransmit a packet, it retransmits it as soon as possible. Retransmissions take priority over new data packets. Retransmitted packets are sent at the current rate used for new data packets from the sender. Each repair head computes the average data rate of all packets it receives and sends retransmissions at this rate.
  • TRAM Duplicate Retransmission Avoidance
  • TRAM sends the packet immediately for the first request. Subsequent requests are ignored if they are received within 1 second of the first request.
  • Every repair head in TRAM keeps track of the lowest packet sequence number that all members have received. Before a repair head retransmits a packet that has been waiting to be retransmitted, it again checks the sequence number of the packet to be retransmitted against this lowest packet number. If the retransmission packet sequence number is lower, the repair head skips this retransmission because all of its members have already acknowledged the receipt of the packet. This can happen when multiple repair heads retransmit the same packets and their transmission range overlaps.
  • TRAM's packet scheduler computes the amount of time to delay each packet in order to achieve the desired data rate.
  • the delay is computed with the formula:
  • TRAM then sleeps for the calculated period, sends the packet, and the cycle continues. This is similar to the widely known token bucket algorithm.
  • TRAM's flow control uses various algorithms such as slow start and congestion control to dynamically adapt to network conditions. A maximum and a minimum rate can be specified to limit the operation of these algorithms. The minimum rate effectively defines the receiver population. Any receiver that cannot keep up with the minimum rate will be pruned (no longer guaranteed repairs).
  • TRAM sessions go through two phases of flow control: slow start; and congestion control.
  • the slow start phase is the initial phase during which TRAM carefully tests the network to find an appropriate operating point. This is analogous to TCP's slow start. After the slow start phase, TRAM will have established some boundaries for its operation and enters the congestion control phase.
  • the initial data rate starts at 10% of the maximum, or the minimum rate if that is greater. Every two ACK windows this rate is increased another 10% of the maximum data rate. This process continues until the maximum rate is reached or congestion causes the rate to decrease.
  • Congestion is detected at the receivers and repair heads.
  • Receivers detect and report congestion based on missing packets.
  • Repair heads detect and report congestion based on their cache usage.
  • Receivers detect and report congestion based on an increase in the number of missing packets between two ACK windows. For example, if a receiver detects 5 missing packets during an ACK window, and has 10 packets missing in the next window, a congestion message is sent to its repair head. The congestion message contains the highest sequence number received. When the repair head receives the congestion message, it determines whether this is a new congestion report and if so, forwards it immediately up to its repair head. Each repair head will forward one congestion packet from its members for each ACK window. The repair head computes the ACK window from the sequence number specified in the congestion message with the formula: sequence number / ACK window size.
  • congestion reports for its ACK window and previous ACK windows will be ignored. The sender does not react to multiple congestion reports for the same window.
  • Repair heads also generate congestion messages when their data caches begin to fill up. Each repair head maintains a low and high water mark on its cache. When the number of packets in the cache reaches the high water mark, an attempt is made to purge the cache. If it can't purge the cache below the high water mark because a member has not acknowledged these packets, a congestion message is forwarded up the repair tree. In this situation the repair head temporarily increases its high water mark to the current value plus the number of packets in an ACK window. The repair head performs the same test when this new temporary high water mark is reached. If the cache is exhausted, new packets are dropped without acknowledging them.
  • the sequence number in a repair-cache-generated congestion message is the highest sequence number the repair head has received.
  • the sender also maintains a cache for its immediate group members. If its cache fills up to the high water mark and can't be reduced, it reacts as if it received a congestion message for that window. It also temporarily increases its high water mark to the current value plus the size of an ACK window. If the cache fills to this new level and can't be reduced, it reacts again.
  • the application can start sending data again.
  • the sender reacts to congestion feedback as follows: react to selected congestion reports; decrease the rate in the face of congestion; and, increase the rate in the absence of congestion.
  • TCP congestion control is based on TCP traffic and the algorithms implemented in the TCP protocol.
  • One of the key ingredients of TCP's algorithm is to follow the additive increase/multiplicative decrease rule.
  • TRAM transmission is rate based, an immediate problem is determining the right rate increase in the absence of congestion.
  • TCP is a window based protocol in which increases are done by incrementing the congestion window by one, a dimensionless parameter.
  • the correct amount to increase TRAM's rate would be a small fraction of the bottleneck bandwidth. Plus, this amount would need to be adopted by all the flows sharing the same bottleneck. This increase is not easily determined. A constant increase amount will always be wrong for some topologies. Although we used 10% of the maximum rate for slow start, it does not seem suitable since the maximum rate may be far from the current bottleneck rate. Instead, TRAM derives the increase amount dynamically as follows. TRAM keeps track of the historically highest achieved rate (HHR). After each rate decrease, a new increase amount is calculated as:
  • the receipt of a congestion report causes the data rate to drop by a percentage (for example 50%, and we are also experimenting with 25%). This is the same as TCP. Adjusting the rate by a percentage is very appealing since the adjusted amount is dimensionless, hence there are no calibration problems.
  • the next important aspect of a congestion control scheme is how to keep the feedback and control in synchrony with each other.
  • each control action is based on the feedback from the network that reflects the consequence of the previous control. From systems theory, we know that there is a chance of building a control that leads to optimal and stable behavior.
  • the congestion control algorithm In order to make sure that each congestion feedback includes the previous control actions, the congestion control algorithm must wait for several windows before acting on the feedback. This tends to make the system less responsive to topologies with a small number of receivers.
  • TRAM increases the rate every other ACK window in the absence of congestion reports.
  • TRAM includes mechanisms to take into account the effect of retransmission when determining the rate during periods of congestion.
  • the receivers report an estimate of how long it will take to do local repairs. This information is aggregated back to the sender.
  • the sender In reaction to a congestion message, the sender not only reduces its rate, but also pauses briefly to let the local repairs complete.
  • the NS simulation tool used was version 2 of the Network Simulator, http://www.mash.cs.berkeley.edu ns.
  • the TRAM protocol is modeled using the simulation tool NS.
  • the object-oriented simulation environment, the scripting language support, plus the companion animation tool, NAM, enable effective experiments.
  • a network topology 900 used in simulation is shown. The results from two variants of the topology shown in Figs. 4A, 4B, and 4C. are disclosed.
  • the basic network consists of two hundred one (201) nodes.
  • the sender agent 902 runs on one node, repairer agents run on 24 of the nodes, and pure receiver agents run on 168 of the nodes.
  • the other 8 nodes are router-only nodes.
  • the whole network is symmetric.
  • Each repairer is 2 hops away from the sender, and each receiver is 1 hop away from its parent (repair head).
  • the links from the sender to the first tier routers are 1.5 Mb/s links with 50 msec delay.
  • the rest of the links are 1.5 Mb/s links with 10 msec delay, except 3 of these links are 0.5 Mb/s links with 10 msec delay. These 3 slow links are the bottleneck of the multicast session.
  • the 3 slow links are further programmed to deterministically go up and down. Every 3 seconds, they go down for 0.05 seconds. On average the down time is less than 2%.
  • the experiment is to send 1000 packets, 1400 bytes each. The sending starts at 1.5 seconds from the beginning of the simulation. Since the limiting bandwidth is 0.5 Mb/s, assuming ideal scheduling the whole transmission should take 22.4 and 23 seconds respectively.
  • Fig. 5 and Fig. 6 show how TRAM did for each of the two cases.
  • the X axis is time in units of seconds.
  • the Y axis scale is used for a number of things:
  • Fig 5 is a graph 10,002 showing results for a network without up/down dynamics.
  • Fig. 6 is a graph 11 ,002 showing results for a network with up/down dynamics.
  • Fig 5 for a static 201 node network is next discussed.
  • the top curves represent the monitored rate and the send rate.
  • the next curve is the cache occupancy observed at a repair head that is responsible for a receiver that is behind the slow link. Losses are shown along the bottom. As can be seen, the buffer occupancy shoots up soon after the losses start to occur.
  • the maximum possible transmission rate is 50 (Kb/s).
  • TRAM manages to keep the rate oscillating between 30 and 60. The initial spike is bigger, the result of slow start when there is no hint what the possible maximum rate is. The subsequent improved performance is what we expected.
  • Fig. 6 shows results for the simulation network of Fig. 4 and including adding up/down dynamics to some links.
  • the link up/down dynamics clearly induce very periodic losses at the times the links turn off (every 3 seconds).
  • TRAM adapts quite well, except right after the first link down, when the losses induced a high cache occupancy. This is because a retransmission failed to get to the receiver behind the faulty link for quite a long time. After it overcame the initial difficulty, the rate oscillated between 30 and 60 as in the static network case. The time of completing the test is only marginally longer than the static network case.
  • Receivers joining the multicast group after data transmission has started have two options for recovering data previously sent:
  • a repair head typically has at least the last 50 packets sent, in its cache.
  • End Of Transmission Receivers must be able to determine when the session has completed to ensure they have received all of the data before exiting.
  • end of transmission is signaled throughout the multicast group.
  • the sender notifies all members of session completion with a beacon packet that has the TXDONE flag set. This packet also includes the sequence number of the last data packet sent. The sender transmits this packet periodically until all of its immediate members acknowledge the receipt of all packets sent. The sender can then exit.
  • a member receives the beacon packet with the TXDONE flag set, it immediately sends an ACK message to its repair head indicating whether it has received all the packets transmitted or requires more retransmissions.
  • TRAM notifies the application when it receives all of the packets.
  • a repair head When a repair head receives the beacon packet with the TXDONE flag set, it communicates with its repair head just as a receiver does. The repair head must wait for all of its members' to acknowledge all packets before it can close the session. If a member requires retransmission, the repair head must retransmit all the packets required of its members prior to closing itself. If the beacon from the sender with the TXDONE flag set is received but one or more members have not acknowledged all packets, a Hello message is sent to these members with the same information contained in the beacon packet. Members receiving this Hello message must respond in the same way that they would if they received the beacon. If the repair head still doesn't hear from its members after sending the Hello, it retries several times. After a period of time it gives up on the member and removes it from the member list.
  • the repair head can close its session.
  • the source of a multicast stream can operate in a mode that is sensitive or insensitive to the data reception feedback from the receivers.
  • the drawbacks of being insensitive are a lack of response to network congestion and an inability to deliver the data to as many receivers as possible. Being sensitive makes the multicast distribution mechanism overcome the above drawbacks but also introduces a new drawback that may make the sender operate at a rate that is slower than what is desired by the application. To overcome this drawback, it is necessary that the multicast delivery system support some sort of a pruning mechanism which enables receivers that do not meet the reception criteria to be isolated and removed from the repair mechanism. In TRAM, the reception characteristics of all the receivers is distributed knowledge.
  • TRAM adopts a collaborative pruning technique that involves the sender and all the repair heads in the system.
  • the technique requires the sender to orchestrate the pruning operation by providing a MinimumDataRate signal.
  • the signal is included in the header of multicast data and beacons sent by the sender.
  • the signal is set to OFF when no congestion is being reported from the sender.
  • the sender attempts to reduce the rate of transmission to accommodate the slow receivers.
  • the sender sets the MinimumDataRate signal ON when the sender is operating at the minimum rate specified by the application.
  • the MinimumDataRate signal informs repair heads in the distribution tree to prune any poorly performing receivers.
  • the repair heads may respond to receiving the MinimumDataRate signal by pruning members.
  • Pruned members can be members that are slow, members that are requesting excessive repairs or members that have become unresponsive as a result of a network partition or for some other reason.
  • the members that are pruned are notified of membership termination via the Hello- Unicast message.
  • the repair head may stop honoring repair requests from members that are pruned.
  • repair heads can independently perform the pruning operation (i.e., without a sender signal). This may result in premature pruning of the members, as the repair heads may not know whether or not the sender is operating at the configured minimum rate.
  • Communication in a computer internetwork involves the exchange of data between two or more entities interconnected by communication media.
  • the entities are typically software programs executing on hardware computer platforms, such as end stations and intermediate stations.
  • communication software executing on the end stations correlate and manage data communication with other end stations.
  • the stations typically communicate by exchanging discrete packets or frames of data according to predefined protocols.
  • a protocol in this context, consists of a set of rules defining how the stations interact with each other.
  • the hardware and software components of these stations generally comprise a communications network and their interconnections are defined by an underlying architecture.
  • Modern communications network architectures are typically organized as a series of hardware and software levels or "layers" within each station. These layers interact to format data for transfer between, e.g., a source station and a destination station communicating over the internetwork. Predetermined services are performed on the data as it passes through each layer and the layers communicate with each other by means of the predefined protocols.
  • Examples of communications architectures include the Internet Packet Exchange (IPX) communications architecture and, as described below, the Internet communications architecture.
  • IPX Internet Packet Exchange
  • the Internet architecture is represented by four layers which are termed, in ascending interfacing order, the network interface, internetwork, transport and application layers. These layers are arranged to form a protocol stack in each communicating station of the network.
  • FIG. 7 there is illustrated a schematic block diagram of prior art Internet protocol stacks 12,125 and 12,175 used to transmit data between a source station 12,110 and a destination station 12,150, respectively, of an internetwork 12,100.
  • the stacks 12,125 and 12,175 are physically connected through a communications medium 12,180 at the network interface layers 12,120 and 12,160.
  • the protocol stack 12,125 will be described.
  • the lower layers of the communications stack provide Internetworking services and the upper layers, which are the users of these services, collectively provide common network application services.
  • the application layer 12,112 provides services suitable for the different types of applications using the internetwork, while the lower network interface layer 12,120 accepts industry standards defining a flexible network architecture oriented to the implementation of local area networks (LANs).
  • LANs local area networks
  • the network interface layer 12,120 comprises physical and data link sublayers.
  • the physical layer 12,126 is concerned with the actual transmission of signals across the communication medium and defines the types of cabling, plugs and connectors Is used in connection with the medium.
  • the data link layer is responsible for transmission of data from one station to another and may be further divided into two sublayers: Logical Link Control (LLC 12,122) and Media Access Control (MAC 12,124).
  • the MAC sublayer 12,124 is primarily concerned with controlling access to the transmission medium in an orderly manner and, to that end, defines procedures by which the stations must abide in order to share the medium. In order for multiple stations to share the same medium and still uniquely identify each other, the MAC sublayer defines a hardware or data link address called a MAC address. This MAC address is unique for each station interfacing to a LAN.
  • the LLC sublayer 12,122 manages communications between devices over a single link of the internetwork.
  • IP Internet protocol
  • TCP Transmission Control Protocol
  • Data transmission over the internetwork 12, 100 therefore consists of generating data in, e.g., sending process 12,104 executing on the source station 12,1 10, passing that data to the application layer 12,1 12 and down through the layers of the protocol stack 12, 125, where the data are sequentially formatted as a frame for delivery onto the medium 12,180 as bits. Those frame bits are then transmitted over an established connection of medium 12,180 to the protocol stack 12, 175 of the destination station 12,150 where they are passed up that stack to a receiving process 12,174. Data flow is schematically illustrated by solid arrows.
  • each layer is programmed as though such transmission were horizontal. That is, each layer in the source station 12, 1 10 is programmed to transmit data to its corresponding layer in the destination station 12, 150, as schematically shown by dotted arrows. To achieve this effect, each layer of the protocol stack 12,125 in the source station 12,1 10 typically adds information (in the form of a header) to the data generated by the sending process as the data descends the stack.
  • the internetwork layer encapsulates data presented to it by the transport layer within a packet having a network layer header.
  • the network layer header contains, among other information, source and destination network addresses needed to complete the data transfer.
  • the data link layer encapsulates the packet in a frame, such as a conventional Ethernet frame, that includes a data link layer header containing information, such as MAC addresses, required to complete the data link functions.
  • a frame such as a conventional Ethernet frame
  • information such as MAC addresses
  • the destination of a data frame (“message") issued by a source (“sender”) May be more than one, but less than all of the entities (“receivers”) on a network; this type of multicast data transfer is typically employed to segregate communication between groups of receivers on the network.
  • IP multicasting in particular, may be used to disseminate data to a large group of receivers on the network.
  • any number of data messages may be lost in transit due to errors or overloading of networking equipment. Ensuring that each receiver/member of a multicast group has received all of the data messages is difficult for a single sender to determine once the group is of any size, since messages from each member to the sender can overload the sender.
  • One approach to providing scalable reliable multicasting is to organize the receivers into a tree structure so that each internal "node" of the tree is responsible for helping its subordinates recover any lost packets and communicating status back to the sender.
  • Many conventional algorithms exist for constructing such a tree For example, reliable multicast protocols such as TMTP and RMTP build trees that are used for an entire data transfer session without optimization.
  • Lorax describes methods for generally enforcing member limits. After such a tree is constructed, it may be further optimized as network conditions change.
  • a sending process generally specifies a destination IP address that is a multicast address for the message.
  • Receiving processes typically notify their internetwork layers that they want to receive messages destined for the multicast address; this is called “joining a multicast group". These receiving members then "listen" on the multicast address and, when a multicast message is received at a receiver, it delivers a copy of the message to each process that belongs to the group. The result is that the message traverses each link between the sender and receivers only once.
  • a multicast flow occurs.
  • Flow and congestion control for multicast transport is a relatively new research topic. For minutes of an Internet Research Task Force meeting on this topic in September 1997, see http://www.east.isi.edu/RMRG/notes-revO.html.
  • flow and congestion control algorithms adaptively find an optimal (transmission) rate for a multicast flow, based on available bandwidth of all links involved in the transmission and the speed of all the receivers involved.
  • the flow and congestion control algorithm should exhibit some level of fairness in using the congested resources.
  • Adaptive control of transmission rate is based on feedback from the network, as is done in unicast flows.
  • a multicast flow tends to traverse more links and depend on the speed of more receivers than a unicast flow. This dependence on more resources makes the multicast flow control problem substantially more complicated than the case for a unicast flow.
  • the present invention is directed to an efficient flow and congestion control technique for multicast flows.
  • the present invention generally relates to a scalable, reliable multicast transport protocol (TRAM) that supports bulk data transfer with a single sender and multiple receivers of a computer internetwork, such as an intranet or Internet.
  • TRAM uses reliable multicast repair trees that are optimized to implement local error recovery and to scale to a large number of receivers without substantially impacting the sender.
  • the protocol includes a flow and congestion control technique that enables reliable, efficient and fair operation of TRAM with other protocols across a wide variety of link and entity characteristics of the computer internetwork.
  • TRAM is a tree based reliable multicast protocol.
  • TRAM enables applications requiring reliable multicast to be essentially free of transport related issues like: Transmission of data between the sender and the receivers reliably. Direct interaction between the sender and receiver applications. Congestion control, ACK: implosion and other scalability issues.
  • TRAM requires no prior knowledge of the receiver community. Also, scalability is non-trivial in former reliable multicast technology, and TRAM achieves this by dynamically grouping the tuned receiver community into hierarchical groups. Grouping enables TRAM to avoid the ACK/NACK implosion problem and to perform local repair operations.
  • the invention provides many features, for example the features of the invention include: reliable multicast; single source to many receivers; scalable - ability to support a large receiver community; support local repair; support adaptive congestion control mechanisms to prevent network flooding; ordered data delivery; support unidirectional and multidirectional multicast environments during the initial building of the tree and for late joins, and reaffiliation during data transfer; control bandwidth used by multicast control messages during tree formation anti data transfer; scalable up to a million receivers; late joins without data recovery; support for real-time data and resilient category of applications; and, unordered data delivery.
  • Introduction To Reliable Multicast Multicasting provides an efficient way of disseminating data from a sender to a group of receivers.
  • the RM group forming within the IRTF has broadly classified the applications requiring multicast into the following categories:
  • the receivers in TRAM are dynamically grouped into hierarchical groups to form a tree 14,000.
  • the sender 14,002 is at the head of the tree.
  • the parent 14,012; 14, 014; 14,016; 14,018; 14,020 of each respective group 14,012-1 ; 14014-1; 14,016-1; 14,018-1; 14-020-1 is a repair head.
  • Data is multicast by the sender and all the receivers receive it.
  • the repair heads in the tree cache the received data messages.
  • the members of a group need not cache the data.
  • Caches 14,030-X are shown for each respective repair head.
  • the members send acknowledgments of receiving the data to the associated/affiliated head.
  • the heads can free the cached data messages upon receiving acknowledgments (ACK messages) from all the members.
  • ACK messages acknowledgments
  • NACK messages retransmission requests
  • the group heads retransmit the requested message with a local TTL scope that is large enough to reach all its members.
  • TRAM dynamically organizes the tuned receiver community into multi-level hierarchical groups 15,002; 15,004; 15,006; 15,008; 15,010; 15,012; 15,014 named RxGroups.
  • Every RxGroup comprises of a group head known as RxGroup-head and a configurable number of group members known as RxGroup-members.
  • RxGroup-head For example, for RxGroup 15,014, the group head 15,014-H, and members 15,014-1; 15,104-2; 15,014-3 are shown.
  • the transport supporting the sender 15,020 application is by default a RxGroup-head.
  • a RxGroup in which the transport supporting the sender operates as the group head is known as a Primary-RxGroup, and all the rest as Secondary-RxGroups.
  • a RxGroup-Member of one RxGroup can in turn play the role of a group head to its lower level RxGroup, as member 15,008-1 of group 15,008 is a group head for group 15,010.
  • a RxGroup-head is primarily responsible for caching the sent/received multicast data to participate in local repair/retransmission operations.
  • Multicast messages received by the RxGroup- members are acknowledged with the aid of unicast ACK messages.
  • the ACK messages are sent to the respective RxGroup-heads to distribute and overcome the ACK implosion problem.
  • the ACK reporting is done using a window mechanism.
  • a receiver node which is not part of any RxGroup or is in the process of affiliating to a
  • RxGroup is known as a RxNode, as illustrated by RxNode 15,030.
  • the HState information is maintained and advertised by TRAMs that are currently performing the role of a RxGroup-head. This information is included in the RxGroup Management messages that are multicast with a local scope.
  • the RxGroup-members use the HState information to decipher the current state of a RxGroup-head in the neighborhood.
  • the different HState states are:
  • Fig. 10 there is shown an HState transition diagram 16,000.
  • the Accepting Members state 16,002 indicates that the RxGroup-head has the potential of accepting new RxGroup-members
  • Not_AcceptingJMembers 16,004 state means the opposite.
  • Resigning state 16,006 means that the RxGroup-head is in the process of giving up the RxGroup- head role and is indicating to its dependent RxGroup-members to re-affiliate to a different RxGroup- head.
  • Re-affiliation of RxGroups is triggered when a group member decides that it wants to affiliate with a different head. This may occur because its old head is resigning or not responding, or because the member has discovered a better head (in terms of closeness).
  • a functioning head can typically resign when the user is attempting to exit out of the multicast group, or when the functioning head has determined itself to be redundant in the region. Detection of a better head and redundant heads in a region are made possible by reception and processing of various multicast control messages generated by the heads and members in a region. The various steps involved in the re-affiliation process are listed below:
  • a member decides to re-affiliate. It finds a head that it wants to re-affiliate to (by checking Helios, HAs, or using MTHA).
  • the member uses the normal TRAM affiliation mechanisms to affiliate with the new head (sending a Head Bind and receiving an Accept Member or Reject Member). If this affiliation fails, it goes back to step 1 (finding a head).
  • the member Once the member has affiliated to the new head, it maintains its affiliation to its old head until it successfully receives all missing packets that are earlier than the starting sequence number of the packets that is guaranteed to be cached by the new head.
  • the new head reports the starting sequence number of the packets that will be cached via the AM message.
  • the member sends ACKs to both the old and the new heads (unless the old head is dead). If the new head becomes unresponsive during this interval, the member goes back to step 1 (finding a head). This interval is known as the Transition Interval.
  • the member if it is itself a head, it continues to function as a head and is not allowed to accept new members.
  • the whole re-affiliation process is straightforward and simple when the re-affiliating member is not performing the role of a head.
  • certain additional checks have to be performed while selecting the new head so as to avoid forming malformed repair tree or loops.
  • a loop can be formed when a head higher up in the tree hierarchy re-affiliates with a head that is a descendent of itself.
  • TRAM avoids the loop formation by propagating a tree level information called RxLevel as part of the tree management information.
  • the sender is said to be at RxLevel 1
  • the heads that are members of the sender's RxGroup are said to be at RxLevel 2 and so on.
  • Loops are avoided by adopting the policy that a member performing the role of a head will not re-affiliate with any head whose RxLevel is equal or greater that its own RxLevel. Further, a head upon losing its heads is unable to find a suitable head for more than 5 minutes is forced to resign. This is important, since members of an unaffiliated head are disconnected from the sender. They may not receive repairs and cannot provide congestion feedback.
  • the Sender in a TRAM application multicasts data packets to all of the receivers in the multicast group.
  • the application calls the putPacket method (or write method for the stream interface) to queue up packets for transmission.
  • the output dispatcher sends the packets at the specified rate. Each packet is given a unique sequence number starting at one (1). Receivers use these numbers to detect out of order and missing packets.
  • the sender When the sender receives a request for retransmitting a packet, it queues the requested packet up immediately. Retransmissions take priority over new data packets. Retransmitted packets are sent at the same rate as regular data packets from the sender. Repair heads compute the average data rate of all packets it receives and sends retransmissions at this rate.
  • TRAM Duplicate Retransmission Avoidance
  • TRAM sends the packet immediately for the first request. Subsequent requests are ignored if they are received within a chosen time interval, where in an exemplary embodiment of the invention the chosen time interval is one (1) second of the first request. Occasionally many packets are queued up waiting for retransmission. If a new request for a packet is received and that packet is already on the transmit queue, the request is ignored.
  • each member sends an ACK when packet 32, 64, and 96 arrives.
  • each member selects a random packet between 1 and the ACK window to start sending ACK messages.
  • the start point for sending ACK might be packet 10.
  • the member sends an ACK message when packet 10 (or greater) arrives, another at packet 42, the third at packet 74, and so on.
  • Each ACK message contains a start sequence number and a bit mask length. If no packets were missing the bit mask length is zero (0) and the sequence number indicates that all packets prior to and including this packet were successfully received.
  • the repair head saves this information and uses it to remove packets from its cache.
  • the start sequence number indicates the first missing packet.
  • a bit mask must follow. Each bit in the mask represents a packet sequence number starting with the start sequence number. If the bit is set, that packet is missing and must be retransmitted. A bit mask length indicates how many valid bits are present.
  • the repair head When the repair head receives an ACK message with a missing packets bit mask, the sequence number specified minus one (1) is saved for this member. This indicates that all packets prior to this sequence number have been received successfully. The repair head then scans the bit mask looking for missing packets. It immediately places these packets onto the transmit queue unless they have recently been retransmitted or are already on the queue from another request.
  • the sender maintains a data rate between a minimum and maximum specified rate. The rate is increased every two (2) ACK windows and decreased for each new congestion report. If the senders data cache fills up, the sender stops sending new data until it can reduce its cache below the high water mark.
  • the actual rate scheduler is implemented as follows. When the application places a packet on the transmit queue, the output dispatcher sends the packet on the multicast socket. It then computes the amount of time to delay in order to achieve the desired data rate. The delay is computed with the formula:
  • the overhead in processing the packet is subtracted from this delay.
  • the output dispatcher then sleeps for the calculated period and the cycle continues.
  • the initial data rate starts at 10% of the maximum or the minimum rate if that is greater. Every two (2) ACK windows this rate is increased another 10% of the maximum data rate. This process continues until the maximum rate is reached or congestion causes the rate to decrease.
  • Congestion reports from the receivers cause the data rate to drop 25%. After congestion the rate increments are more conservative in an attempt to alleviate the congestion.
  • the new rate increment is computed from the previous rate increment value as follows:
  • This algorithm allows the data rate to increment quickly back to the point where congestion was reported.
  • Congestion Control Congestion is detected at the receivers and repair nodes. Receivers detect and report congestion based on missing packets. Repair heads detect and report congestion based on their cache content.
  • Rate Based Congestion Detection Receivers detect and report congestion when the number of outstanding missing packets between two ACK windows increases. For example:
  • a congestion message is sent to its repair head.
  • the congestion message contains the highest sequence number received.
  • the repair head determines whether this is a new congestion report and if so, forwards it immediately up to its repair head.
  • Each head will forward one congestion packet from its members for each ACK window.
  • the head computes the ACK window from the sequence number specified in the congestion message with the formula:
  • the repair head will send one congestion message up the tree for each ACK window. Once a congestion message has been forwarded up the tree, congestion reports for previous ACK windows will be ignored. The sender will also ignore any congestion messages for the same or earlier windows.
  • Repair heads also generate congestion messages when their data cache begins to fill up. Each head maintains a low and high water mark on their cache. When the number of packets in the cache reaches the high water mark, an attempt is made to purge the cache back to the low water mark. If it can't purge the cache below the high water mark because a member has not acknowledged these packets, a congestion message is forward up the repair tree. In this situation the repair head increases its high water mark to the current value plus the number of packets in an ACK window. The repair head performs the same test when this new threshold is reached.
  • the sequence number in repair cache generated congestion message is the highest sequence number the head has received.
  • a receiver joins the multicast group after data transmission has started, it has two options in TRAM.
  • This packet also includes the last data packet sequence number sent. The sender transmits this packet periodically until all of its immediate members acknowledge the receipt of all packets sent. The sender can then close its session.
  • a member When a member receives the Beacon packet with the TXDONE flag set, it immediately sends an ACK message to its head indicating whether it has received all the packets transmitted or requires more retransmissions. The receiver returns a SessionDone Exception to the application when the application has received all the packets.
  • the repair head When the repair head receives the Beacon packet with the TXDONE flag set, it communicates with its head just as a receiver does. The head must wait for all for all of its members to respond with their final ACKs before it can close the session. If a member requires retransmission, the head must retransmit all the packets required of its members prior to closing itself. If the Beacon from the sender with the TXDONE flag set is received but one or more members do not respond with their final ACK message, a Hello message is sent to these members with the same information contained in the Beacon packet. Members receiving this Hello message must respond in the same way they would had they received the Beacon. If the head still doesn't hear from its members after sending the Hello, it retries several times. After a period of time it gives up on the member and removes it from the member list.
  • the head can close its session.
  • TRAM Operation TRAM operation is best described by considering the TRAM protocol at the sender and at the receiver separately.
  • the sender application opens the TRAM session by specifying the transport profile.
  • the transport profile includes details such as the multicast address, port, minimum and maximum rates of transmissions, Transport mode, and various other protocol related parameters.
  • the sender TRAM after validating the transport profile, joins the multicast group.
  • the Transport mode in this case is SEND_ONLY, the transport assumes the role of Group-head and starts generating the sender-beacon to initiate the RxGoup formation process.
  • TRAM relies upon the application to decide when it is appropriate to start the multicast data transmission.
  • TRAM maintains information such as the size of the tuned receiver community at anytime which can be polled by the application to make this decision.
  • the slow start mechanism involves starting the data transmission at a minimum rate and gradually increasing the data rate in steps until a suitable maximum rate is achieved.
  • the sender TRAM opts to transmit at the minimum rate so as to alleviate and allow repair operations to take place.
  • Sender TRAM provides no data rate guarantees other than attempting to hand over multicast data to the underlying network layer with in the transmission framework specified by the application.
  • the rate at which the data messages are handed over to the lower layer is with reference to the messages on the DataQ (or messages that are being transmitted for the first time) and do not take into account the messages that are being retransmitted.
  • the data message is encapsulated in a TRAM header message and is sent to the multicast group.
  • the TRAM header among other things, include a sequence number which enable the receiver TRAMs to order (if required) and detect packet loss.
  • the message is moved to the Retrans-Q.
  • the RxGroup-members use a window mechanism to acknowledge the receipt of the multicast messages.
  • the message on the RetransQ undergoes the state transition (described earlier) before being freed. If data cache usage is found to be above the high water mark, then the congestion control and analysis operation on the RetransQ is initiated to isolate and recover from the condition.
  • the RxGroup-members upon detecting data loss, requests retransmission from the RxGroup- head.
  • the sender performs the retransmissions using the local TTL scope. If a retransmission request is made for a message that has been released from the data cache, the sender informs the unavailability of the message via a Hello message. This is one of the rare occasions when the sender TRAM generates a Hello message.
  • the DATA END sub message type in the data message indicates the end of data transmission. Further, to enable the receivers to identify that the data transmission has ended, the sender continues to send a few sender-beacons with data transmission complete bit set in the flag space. The sender-beacon also includes the sequence number of the last message. This will enable the receivers that may have missed the last message request retransmissions from their heads. When all the data messages have been successfully received, the members can terminate their RxGroup membership. The RxGroup-heads have to stay on until every member acknowledges every message on the RetransQ. Optionally the sender TRAM can be configured to remain active for a specified interval of time to gather certain statistics related to the multicast data transmission. The sender- beacon is used under this condition to maintain the RxGroup relationship.
  • the receiver application starts the TRAM session by specifying the transport profile.
  • the receiver TRAM after validating the transport profile, joins the multicast group and stays idle until the sender-beacon is received.
  • the receiver TRAM Upon receiving the sender-beacon or the multicast data message or a HA message from another node (as described hereinbelow), the receiver TRAM starts participating in the RxGroup-formation process.
  • a RxGroup-member intending to be a RxGroup-head can optionally cache the multicast data before actually assuming the role.
  • RxNode(s) can receive and store the multicast messages on their RetransQ but are not allowed to seek retransmissions until they are affiliated to a RxGroup-head.
  • the receiver TRAMs acknowledge the received multicast messages with the help of the ACK messages.
  • RxGroup-members willing to perform the role of the head can send HA messages.
  • the RxGroup-member starts performing the role of a head upon receiving the first AM (as further described hereinbelow) from a RxNode.
  • the Hello messages are initially generated using the extracted TTL from HB messages (as further described hereinbelow). If the multicast path to the member is not symmetric, then the TTL may not be appropriate.
  • the member will inform the head if the Helios are not being received. In this case the head will have to go through a correction phase until the member indicates that the Helios are being received.
  • Retransmissions are multicast by the head (s) with a TTL scope that is just enough to reach its farthest RxGroup-member.
  • the TTL, value is maintained and updated by the RxGroup-head every time a new member is accepted.
  • the TRAM at the receiver can be configured to continue or abort when late join or irrecoverable data loss is detected. If data loss is accepted, TRAM signals the event to the application.
  • Fig. 11 is a chart showing various messages and sub-messages used in TRAM.
  • Multicast management, or MCAST_MANAGEMENT messages have Sub-Message types as follows: BEACON; HELLO; HA Head Advertisement; and, MS Member Solicitation.
  • Multicast data or MCAST DATA messages have the Sub-Message types: DATA as a TRAM data message; and, DATA RETXM for data retransmission messages.
  • Unicast messages, or UCAST_MANAGMENT messages have the Sub-Message types: AM Accept Membership Message; RM Reject Membership message; HELLO_Uni a hello message with an ACK request; ACK an acknowledge message; CONGESTION, a rate based congestion message; and, HB the TRAM head bind message.
  • Fig. 12 is a table showing the timers used by TRAM.
  • the T_BEACON timer is the inter beacon message interval timer, and in an exemplary embodiment of the invention is set to 1,000 milliseconds (ms).
  • the T_BEACON_FILLER timer is the inter beacon filer interval timer, and in an exemplary embodiment of the invention is set to 30 seconds (sec).
  • the T_ACK_INTERVAL is computed at run time based on the current rate of data transmission and the size of the configured acknowledgment window.
  • the T HELLO timer is the inter Hello interval timer, and in an exemplary embodiment of the invention is set to one (1) per ACK interval.
  • the T_MS timer is the inter MS interval timer, and in an exemplary embodiment of the invention is set to 500 milliseconds (ms).
  • Fig. 13 is a table showing counters used by TRAM.
  • N_ACK_MISSES is the number of ACK messages that can be missed before a head declares the member as non-responsive, and in an exemplary embodiment of the invention is set to a value of four (4).
  • N HELLO MISSES is the number of HELLO messages that a member has missed for the member to declare the head as non- responsive, and in an exemplary embodiment of the invention is set to a value of five (5).
  • N_HB_RETXM HB head bind message can be sent before the member tries another head, and in an exemplary embodiment of the invention is set to a value of three (3).
  • N_MS_RETXM is the number of times a MS member solicitation message needs to be sent before a head increases its TTL.
  • Fig. 14 through Fig. 21 give the fields of the different messages used in TRAM.
  • the figures conventionally show 32 bits (bits 0-31) horizontally as a word. Successive 32 bit words are shown vertically. Each word is divided into 8 bit bytes, although some fields occupy two 8 bit bytes, or sixteen bits; and some occupy all 32 bits of a word.
  • All of the messages have the first word having the four byte fields: Ver # giving the version number of the software; MType giving the message type; Sub-Type giving the message sub-type; and FLAGS giving eight 1 bit flags, to be described hereinbelow.
  • all messages have the "Length of the message in bytes" in the first two bytes of the second word.
  • the other fields of each message are selected for the particular message, as shown in Fig. 14-Fig. 23.
  • the fields marked "Reserved" have not been assigned to a function.
  • Fig. 14 is a block diagram showing a multicast packet format for a sender Beacon Message.
  • Fig. 15 is a block diagram showing a multicast packet format for a Data Message.
  • Fig. 16 is a block diagram showing a multicast packet format for a HA Message.
  • Fig. 17 is a block diagram showing a multicast packet format for a MS Message.
  • Fig. 18 is a block diagram showing a multicast packet format for a Hello Message.
  • Fig. 19 is a block diagram of a unicast packet format for a ACK Message.
  • Fig. 20 is a block diagram of a unicast packet format for a Hello Uni Message.
  • Fig. 21 is a block diagram of a unicast packet format for a Head Bind Message.
  • Fig. 22 is a block diagram of a unicast packet format for a Accept Membership Message.
  • Fig. 23 is a block diagram of a unicast packet format for a Reject Membership Message.
  • Fig. 24 through Fig. 29 give the FLAG field for the indicated messages.
  • the eight (8) bits of the flag field are shown as bit 7 through bit 0. Each bit is shown separately. Arrows lead to an explanation of the purpose of the bit.
  • the bits which are not labeled have not been assigned a function, and so are reserved.
  • the denial of service category is concerned with issues related to how best a particular service can be denied/obstructed.
  • the integrity of the data is not at stake in this case.
  • TRAM can be severely affected by this. This can be easily accomplished by a rogue application that can flood the network with bogus multicast packets and thereby hamper the normal TRAM operation. TRAM can do nothing to prevent this and the least that can be done is to generate an event to the application when such a condition is predicted. Prediction of this condition is non-trivial, but one possible condition, in an exemplary embodiment of the invention, can be when the sender is performing a lot of retransmissions to all the members.
  • the sender authentication category is concerned with issues related to how the receiver TRAM can be assured that the multicast data message is actually originating from the sender and not from a rogue application. Possible ways to perform sender authentication are to:
  • the receiver authentication category is concerned with issues related to how the sender TRAM can be assured that only the authorized receivers are receiving the multicast data.
  • Possible solutions take a round about approach as there is little that can be done to stop a rogue application from eavesdropping.
  • the round about approach is to encrypt the data so that the rogue applications may not able to decode and use the data. This involves key management and distribution to decrypt the data and to overcome the problem of a rogue application succeeding in decrypting the key, the keys need to be changed frequently.
  • TRAM protocol itself does not support any of the mentioned possibilities, and the applications using the TRAM transport should incorporate a security layer above the TRAM transport.
  • TRAM does not support full late joins and is limited to the extent of the availability of the required messages in the cache.
  • support for full recovery can be provided and is expensive in terms of supporting a large cache for each multicast session.
  • TRAM cannot scale when there are not enough receiving TRAMs that can support the required data cache or perform head duties.
  • NACK messages are unicast to the heads and there is no NACK suppression to head. NACK messages can contribute significantly to congestion as NACK implosion, (depending on the membership count) under conditions where the same message is lost by all members.
  • TRAM relies on the upper layers to provide security.
  • a RxGroup-member When a RxGroup-member requests retransmission of a message that is not found in its head's cache (message that has been aged out of the cache) the following happens in an exemplary embodiment of the invention. Typically this can occur when a new member is accepted.
  • the RxGroup-head using the HELLO message informs the member of the unavailability of the message.
  • the TRAM at the receiver informs the application of the loss of packet and moves on. The application upon processing the event can decide to continue or abort.
  • TRAM drops all nodes that cannot keep up with the sender operating at its assigned minimum transmission rate. Also, the TRAM response to the sender's network being unable to support the data rate, in an exemplary embodiment of the invention, is to drop all receivers.
  • the application knows when all of the data has been received as follows. When the sender application makes a socket close call, a DATA_END Message is sent and the "close" call does not complete until the head responsibilities are complete. At the member side, upon detecting the
  • a RxGroup-head including the sender performs the repair by multicasting the required message with a local scope.
  • the head sends a request for a retransmission from its head, and informs the member of the pending retransmission request. Sanity check on the validity of the message being requested has to be performed with the help of the sequence number before sending a retransmission request.
  • the Multicast and Unicast ports to be used are identified as follows.
  • the unicast port details are included in the management messages.
  • the RxNodes and the Heads at the time of affiliating can inform the unicast port number in use.
  • the TTL scope computed is assumed to be symmetric, and the scope is monitored and repaired as follows. Mechanisms are in place to detect and correct TTL problems in situations where a head's retransmissions are no longer being received. This is done with the aid of Hello and ACK messages.
  • the sender after performing a slow start, settles to transmitting data within the minimum and maximum rates specified.
  • congestion When congestion is reported the sender starts operating at minimum rate specified by the application.
  • the sender sets a bit (Congestion bit) in the flag space of every beacon message or data message that it generates to inform receivers of the condition.
  • the congestion bit only serves to inform the receiver community that the sender is responding to congestion.
  • the sender sets another bit (Prune bit) in the flag space of the data message to indicate the heads to prune in order to keep up with the transmission rate.
  • the head may adopt a simple strategy of pruning the member that has not acknowledged the maximum number of data packets. In the case of multiple receivers that have not acknowledged the same packets, all the members are pruned. In case of multiple members that have not acknowledged the same number of packets but the involved packets are different, then a member that has oldest unacknowledged message, is pruned. If some form of monitoring and analysis is supported, then the head can monitor the retransmission rates of the members or validate if a member has gone off line, etc., to shortlist the malfunctioning members. If the cache level does reach Threshold, an analysis of the short listed members is performed to pick the receiver that needs to be pruned. The monitoring process is stopped when the sender indicates the evaporation of congestion.
  • the sender application may choose to specify total size of the data that is being transferred and the duration within which the transfer has to take place (for example, a typical file transfer application).
  • the sender TRAM can use these additional parameters to determine the average rate (Avgjrate) of data transmission that needs to be achieved to complete the data transmission with the specified duration.
  • the sender starts the data transmission at Avg_rate and then attempts to increase the rate based on slow start mechanism. Whenever congestion is reported, the sender analyses the current data transmission status. If the data transmission thus far has been above the Avg_rate, then the data transfer byte count will be more than the Avg_rate byte count.
  • the sender can afford to suspend data transmission to enable the congestion to evaporate and allow retransmissions to take place.
  • the sender can afford to suspend the transmission until the breakeven byte count point, or when the surplus byte count becomes less than or equal to zero (0).
  • the surplus byte count is given by:
  • Elapsed transmission time (Current time - Start time).
  • the sender resumes data transmission at the Avg_rate and the data packets will have Prune bit set in the flag space.
  • the actions performed in response to the Prune bit are same as described earlier.
  • the advantage of this approach is that the sender can approximately cater to the maximum number of receivers that meet the reception criteria specified by the application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

L'invention se rapporte à un procédé de détection d'un encombrement dans un réseau d'ordinateurs comportant deux fenêtres d'accusés de réception. On détecte un encombrement lorsque se produit un accroissement des messages manquants, indiqué par les deux fenêtres en question, et l'on informe de cet encombrement une station de tête de réparation qui réémet les messages manquants et peut réduire ou accroître la vitesse de transmission. On met en oeuvre le même procédé dans les cas de transmission multidestination en établissant la communication avec une pluralité de stations de tête de réparation.
PCT/US1999/014541 1998-06-30 1999-06-28 Regulation de l'encombrement dans un systeme de communication multidestination fiable WO2000001123A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU47229/99A AU4722999A (en) 1998-06-30 1999-06-28 Congestion control in reliable multicast protocol
EP99930769A EP1018248A1 (fr) 1998-06-30 1999-06-28 Regulation de l'encombrement dans un systeme de communication multidestination fiable

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
US9133098P 1998-06-30 1998-06-30
US60/091,330 1998-06-30
US33666999A 1999-06-18 1999-06-18
US33667199A 1999-06-18 1999-06-18
US09/336,670 US6526022B1 (en) 1998-06-30 1999-06-18 Detecting congestion by comparing successive loss of packets in windows to provide congestion control in reliable multicast protocol
US09/336,660 US6507562B1 (en) 1998-06-30 1999-06-18 Dynamic optimization for receivers using distance between a repair head and a member station in a repair group for receivers having a closely knit topological arrangement to locate repair heads near the member stations which they serve in tree based repair in reliable multicast protocol
US09/336,671 1999-06-18
US09/336,669 1999-06-18
US09/336,670 1999-06-18
US09/336,659 US6505253B1 (en) 1998-06-30 1999-06-18 Multiple ACK windows providing congestion control in reliable multicast protocol
US09/336,659 1999-06-18
US09/336,660 1999-06-18

Publications (1)

Publication Number Publication Date
WO2000001123A1 true WO2000001123A1 (fr) 2000-01-06

Family

ID=27557402

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/014541 WO2000001123A1 (fr) 1998-06-30 1999-06-28 Regulation de l'encombrement dans un systeme de communication multidestination fiable

Country Status (3)

Country Link
EP (1) EP1018248A1 (fr)
AU (1) AU4722999A (fr)
WO (1) WO2000001123A1 (fr)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001049004A1 (fr) * 1999-12-28 2001-07-05 Sun Microsystems, Inc. Utilisation du message balise dans un reseau pour classifier et supprimer les messages
EP1178627A1 (fr) * 2000-08-02 2002-02-06 Hitachi Europe Limited Méthode de transmission à destinations multiples
WO2002017103A2 (fr) * 2000-08-21 2002-02-28 Intel Corporation Procede et appareil permettant de prevenir un etranglement dans une architecture multinoeud
EP1524808A2 (fr) 2003-10-18 2005-04-20 Samsung Electronics Co., Ltd. Régulation du débit de transmission dans un réseau mobile en mode ad hoc
US6920110B2 (en) 2001-02-14 2005-07-19 Microsoft Corporation System and method for transferring data over a network
EP1256212B1 (fr) * 2000-02-16 2006-08-30 Microsoft Corporation Systeme et procede de transfert de donnees sur un reseau
US7139815B2 (en) 2000-02-16 2006-11-21 Microsoft Corporation System and method for transferring data over a network
WO2017052393A1 (fr) * 2015-09-25 2017-03-30 Intel Corporation Techniques efficaces de traitement d'erreur pour des réseaux de multidiffusion basés sur le protocole tcp
EP2647154A4 (fr) * 2010-12-03 2017-09-13 Nokia Technologies Oy Retransmission d2d intra-grappe avec adaptation de liaison instantanée et nombre adaptif de retransmetteurs

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0454364A2 (fr) * 1990-04-27 1991-10-30 AT&T Corp. Protocole de transport à grande vitesse avec deux fenêtres
EP0648062A2 (fr) * 1993-09-08 1995-04-12 AT&T Corp. Procédé de commande adaptive de fenêtres et des débits dans des réseaux

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0454364A2 (fr) * 1990-04-27 1991-10-30 AT&T Corp. Protocole de transport à grande vitesse avec deux fenêtres
EP0648062A2 (fr) * 1993-09-08 1995-04-12 AT&T Corp. Procédé de commande adaptive de fenêtres et des débits dans des réseaux

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SANJOY PAUL ET AL: "RELIABLE MULTICAST TRANSPORT PROTOCOL (RMTP)", IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, vol. 15, no. 3, 1 April 1997 (1997-04-01), pages 407 - 420, XP000683937, ISSN: 0733-8716 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6658004B1 (en) 1999-12-28 2003-12-02 Sun Microsystems, Inc. Use of beacon message in a network for classifying and discarding messages
WO2001049004A1 (fr) * 1999-12-28 2001-07-05 Sun Microsystems, Inc. Utilisation du message balise dans un reseau pour classifier et supprimer les messages
GB2373694B (en) * 1999-12-28 2004-05-12 Sun Microsystems Inc Use of beacon message in a network for classifying and discarding messages
GB2373694A (en) * 1999-12-28 2002-09-25 Sun Microsystems Inc Use of beacon message in a network for classifying and discarding messages
US7139815B2 (en) 2000-02-16 2006-11-21 Microsoft Corporation System and method for transferring data over a network
US7437428B1 (en) 2000-02-16 2008-10-14 Microsoft Corporation System and method for transferring data over a network
EP1256212B1 (fr) * 2000-02-16 2006-08-30 Microsoft Corporation Systeme et procede de transfert de donnees sur un reseau
EP1178627A1 (fr) * 2000-08-02 2002-02-06 Hitachi Europe Limited Méthode de transmission à destinations multiples
CN1329856C (zh) * 2000-08-21 2007-08-01 英特尔公司 多节点体系结构中防止饥饿的方法和装置
WO2002017103A2 (fr) * 2000-08-21 2002-02-28 Intel Corporation Procede et appareil permettant de prevenir un etranglement dans une architecture multinoeud
WO2002017103A3 (fr) * 2000-08-21 2003-08-21 Intel Corp Procede et appareil permettant de prevenir un etranglement dans une architecture multinoeud
US7436771B2 (en) 2001-02-14 2008-10-14 Microsoft Corporation System for refining network utilization and data block sizes in the transfer of data over a network
US7325068B2 (en) 2001-02-14 2008-01-29 Microsoft Corporation Method and system for managing data transfer over a network
US6920110B2 (en) 2001-02-14 2005-07-19 Microsoft Corporation System and method for transferring data over a network
US7502849B2 (en) 2001-02-14 2009-03-10 Microsoft Corporation System for transferring data over a network
US7522536B2 (en) 2001-02-14 2009-04-21 Microsoft Corporation Method for transferring data over a network
EP1524808A2 (fr) 2003-10-18 2005-04-20 Samsung Electronics Co., Ltd. Régulation du débit de transmission dans un réseau mobile en mode ad hoc
EP1524808A3 (fr) * 2003-10-18 2011-10-05 Samsung Electronics Co., Ltd. Régulation du débit de transmission dans un réseau mobile en mode ad hoc
EP2647154A4 (fr) * 2010-12-03 2017-09-13 Nokia Technologies Oy Retransmission d2d intra-grappe avec adaptation de liaison instantanée et nombre adaptif de retransmetteurs
WO2017052393A1 (fr) * 2015-09-25 2017-03-30 Intel Corporation Techniques efficaces de traitement d'erreur pour des réseaux de multidiffusion basés sur le protocole tcp
US10554429B2 (en) 2015-09-25 2020-02-04 Intel Corporation Efficient error control techniques for TCP-based multicast networks

Also Published As

Publication number Publication date
AU4722999A (en) 2000-01-17
EP1018248A1 (fr) 2000-07-12

Similar Documents

Publication Publication Date Title
US6505253B1 (en) Multiple ACK windows providing congestion control in reliable multicast protocol
US6526022B1 (en) Detecting congestion by comparing successive loss of packets in windows to provide congestion control in reliable multicast protocol
US6507562B1 (en) Dynamic optimization for receivers using distance between a repair head and a member station in a repair group for receivers having a closely knit topological arrangement to locate repair heads near the member stations which they serve in tree based repair in reliable multicast protocol
Chiu et al. TRAM: A tree-based reliable multicast protocol
US5905871A (en) Method of multicasting
Floyd A report on recent developments in TCP congestion control
KR101032512B1 (ko) 멀티캐스트 컨퍼런스 세션 참가 방법 및 컴퓨터 판독 가능 기록 매체
Xu et al. Resilient multicast support for continuous-media applications
US20030135784A1 (en) Multicast communication method and system
US6633574B1 (en) Dynamic wait acknowledge for network protocol
US20030031175A1 (en) Method of multicasting
EP1798913B1 (fr) Procédé de commande du transport dans un système de communication sans fil
Baek et al. A tree-based reliable multicast scheme exploiting the temporal locality of transmission errors
EP1018248A1 (fr) Regulation de l'encombrement dans un systeme de communication multidestination fiable
JP2005244897A (ja) 信頼性のある通信方法及びその装置
US6910080B2 (en) Communication efficiency and performance in an unreliable communication environment
Wong et al. A fault-tolerant data communication setup to improve reliability and performance for Internet based distributed applications
WO2000001115A1 (fr) Optimisation dynamique dans un protocole de multidiffusion fiable
Baek et al. A NAK suppression scheme for group communications considering the spatial locality of packet losses
Altameemi et al. Congestion Control Mechanisms and Techniques in Computer Network: A Review
Mukherjee Analysis of error control and congestion control protocols
Kliazovich et al. DAWL: a delayed-ACK scheme for MAC-level performance enhancement of wireless LANs
Block et al. Some design issues of SRMTP, a scalable reliable multicast transport protocol
Scofield Hop-by-hop transport control for multi-hop wireless networks
Wu et al. An area-based feedback implosion control mechanism with deterministic timeouts

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SL SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 1999930769

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1999930769

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWW Wipo information: withdrawn in national office

Ref document number: 1999930769

Country of ref document: EP