WO2005107205A1

WO2005107205A1 - Media stream correlation

Info

Publication number: WO2005107205A1
Application number: PCT/EP2005/004561
Authority: WO
Inventors: John Robert Elwell; Andrew Mark Hutton; Karl Klaghofer; Thomas Stach
Original assignee: Siemens Plc
Priority date: 2004-04-29
Filing date: 2005-04-26
Publication date: 2005-11-10
Also published as: GB0409536D0; GB2413726A

Abstract

There is described a system for rendering a real time media stream to a calling device in a packet based network, when a call from said calling device has been forked to a plurality of destinations and each of said destinations has transmitted a real time media stream to said calling device in response to said. The calling device receives a call set up message and the message signal includes an identifier identifying the correct real time media stream to be rendered at the user terminal. The calling device compares the identifier included in real time media streams received at the device from the destinations; and renders to the device a media stream having an identifier that correlates with that included in the call set up message.

Description

Media Stream Correlation

This invention relates to a method of rendering a real time media stream to a calling device in a packet based network.

The Real-time Transport Protocol (RTP) is used to transmit encoded real-time media streams, for example, audio and video streams, across Internet Protocol (IP) packet-switched networks. A RTP packet comprises a payload and a header. The payload comprises an encoded sample of the medium concerned and the header comprises control information, for example, a payload type indicator, a time-stamp and a packet sequence number. RTP packets are typically transmitted at regular intervals, allowing the medium concerned to be rendered to the recipient in a continuous manner. The RTP is described in detail in IETF RFC 3550.

The well known Session Description Protocol (SDP) describes multimedia sessions for the purpose of session announcement, session invitation and other forms of multimedia session initiation. SDP is used to communicate the existence of a media session to the terminals that are to participate in the session and to convey sufficient information to the terminals to allow them to participate in the media session. In particular SDP is used to communicate to the terminals information concerning which medium to send, which encoding method to use, what packet rate to use, which IP address to transmit to and which port at that IP address to transmit to. The SDP allows a terminal to indicate a session comprising one or more media streams that it wishes to receive. The SDP is described in detail in IETF RFC 2327.

The session initiation protocol (SIP) , which is described in RFC 3261, is a signalling protocol for setting up, managing and tearing down of voice, video and other multi-media sessions in packet based networks. SIP is designed simply to handle these aspects of communication, other protocols such as Real Time Protocol (RTP) mentioned above, are used for actual data transport.

A SIP network is typically composed of four types of logical SIP entities, namely, User Agents (UA) , Proxy Servers, Redirect Servers and Registrars.

User Agents (UA) are endpoint entities that initiate and terminate SIP sessions by exchanging requests and responses. Typical devices that ^' have a UA function in a SIP network include PCs and IP telephones . A proxy server is an intermediary entity that acts as both a server and a client for making requests on behalf of other clients. Requests are serviced either internally or by passing them on to other servers. A proxy server may receive requests and forwards them to another server (called a next-hop server) , which has more precise location information about the callee .

A redirect server is a server that accepts a SIP request, maps the SIP address of the called party into a new address and returns it to its client, typically a proxy server. Registration servers are continually kept updated on the current locations of users .

The primary function of proxy and redirect servers is call routing, the determination of the set of servers to traverse in order to complete the call. A proxy or redirect server can use any means at its disposal to determine the 'next- hop' server, including executing programs and consulting databases . The SIP protocol is a text-based protocol partly modelled on HTTP. There are two types of SIP messages, namely, requests and responses.

Request messages defined in the protocol include; 'INVITE' which is used to initiate a session or change session parameters, 'ACK' which is used to confirm that a session has been initiated and 'BYE' which is used to terminate a session.

In a simple session establishment between two User Agent devices, the first device (belonging to the calling user) sends an INVITE⁰ request containing a Universal Resource Identifier (URI) identifying the desired called user with which the calling user wishes to communicate. SIP proxy servers route this request to the second device at which the requested called user can be found. The second device alerts the called user to the request to communicate, and when the called user answers, the second device sends a positive response to the request.

The initial request message carries an SDP session description known as an "offer" describing the session that the first device would like to receive. The response carries an SDP session description known as an "answer" describing the session the second device would like to receive .

As soon as answer occurs, the second device also starts transmitting RTP packets to the address and port of the first device, indicated in the SDP offer for the medium concerned. Typically, RTP packets from the second device will reach the first device before the SIP response message, since the latter travels through one or more SIP proxies, whereas the RTP packets do not. In the case of an audio media stream, the called user may start to speak almost immediately (e.g., by stating his or her name), and therefore it is important for the calling device to render received audio to the calling user as soon as possible, even before the SIP response message arrives . This helps to avoid an undesirable phenomenon known as "speech clipping", whereby some of the greeting is lost.

If more than one device is associated with the URL of a called user indicated in a SIP request, a SIP proxy processing the request can multi-cast the request to each, of the devices associated with the URL in parallel. This is known as ''forking''. For example, a SIP request may be forked to a called user's fixed phone and cell phone. The called user may answer on any of these devices, in which case the requests to the other devices are cancelled. The calling device will normally receive RTP packets from the answering device followed by a SIP response message from that device .

A problem may arise when two devices to which a SIP Request is forked, are both used to answer the call at approximately the same time. For example, if a SIP request is forked to a called user's mobile phone and fixed phone, and the called user answers the mobile phone at about the same time as somebody else answers the fixed phone.

The proxy that forked the request will select the device from which it first receives a SIP response message indicating answer and will cancel the Request for the other device. The calling device will receive two different streams of RTP packets, one from each of the two devices, although one stream will cease as a result of the proxy cancelling^' the request at the device concerned.

The calling device can recognise that there are two different media streams, because of two different values in the "synchronisation source" (SSRC) identifier in the received RTP packets. It is important that the calling device renders to the user the RTP stream from the device whose SIP response message is accepted by the forking proxy.

The calling device might choose to render to the calling user the RTP stream that arrives first and to discard the other RTP stream when it arrives. On many occasions, this will be the correct choice because the first received data stream originates from the device whose SIP response message is accepted by the forking proxy. In such circumstances, the second received RTP stream will stop after a few packets .

However, occasionally this will be the wrong choice, because the first received media stream at the calling device does not originate from the device whose SIP response message is accepted by the forking proxy. In such circumstances, the calling device will find that the rendered stream stops after a few packets, in which case it should start rendering the other stream. Detecting that a stream has stopped can be a relatively long process (perhaps two or more packet inter-arrival intervals) because of the jitter (delay variation) occurring in the IP network. During this time a significant burst of speech from the non-selected answering user will have been rendered and speech from the selected answering user will have been discarded. In effect, this is speech clipping aggravated by the rendering of speech from another source. This is referred to as "aggravated speech clipping" .

This the present invention aims to alleviate this problem.

According to the invention there is provided a method of rendering a real time media stream to a calling device in a packet based network, when a call from said calling device has been forked to a plurality of destinations and each of said destinations has transmitted a real time media stream to said calling device in response to said call; the method comprising: receiving a call set up message at the user terminal, the message signal including an identifier identifying a correct real time media stream to be rendered at the calling device; comparing at the calling device the identifier included in the message signal with identifiers included in real time media streams received at the calling device from the destinations ; and rendering to the calling device a media stream having an identifier that correlates with that included in the call set up message. The above and further features of the invention are set forth with particularity in the appended claims and together with advantages thereof will become clearer from the consideration of the following detailed description of exemplary embodiments of the invention given with reference to the accompanying drawings, in which:

Figure 1 illustrates a packet based network and messages exchanged between devices in the network in an embodiment of the invention.

Referring now to Figure 1 of the drawings , a packet switched network 1 comprises a first SIP end point 2, a second SIP end point 3 , a third SIP end point 4 and a proxy server 5. A first SIP user (not shown) uses the first SIP end point 2 to try to set up a voice call with a second SIP user (not shown) . The first SIP end point 2 sends an Invite message containing the URL address of the second SIP user and an SDP offer containing the media capabilities of the first SIP end point 2. The Invite message is received at the proxy 5, which determines that both the second SIP end point 3 and the third SIP end point 4 are associated with the URL identified in the Invite message. The proxy 5 thus 'forks' the Invite message, transmitting the Invite message to the second SIP end point 3 and to the third SIP end point 4. As per normal for SIP session initiation, the proxy 5 also transmits a 'Trying' /100 message back to the first SIP end point 2 and likewise, both the second 3, and the third 4 SIP end points transmit 'Trying' /100 messages back to the proxy 5 following their receipt of the Invite message.

Furthermore, as is standard, each of the second 3 and third 4 SIP end points start to 'Ring' to inform of the call and each transmits a 'Ringing' /180 response to the proxy 5, which in turn forwards these responses to the first SIP end point 2. Following the reception of the first of these 'Ringing' /180 responses at the first SIP end point 2, a ringing tone is also audibly presented to the user of the end point 2.

In this example,- the second SIP end point 3 is answered before the third SIP end point is answered 4. When the second SIP end point 3 is answered (e.g. when a user picks up the receiver if the end point is an IP phone) a SIP 'OK'/200 response is transmitted from the second end point 3. This ^λOK'/200 response includes an SDP answer which as per normal identifies the media capabilities of the second SIP end point 3 , but which in addition and unlike in current systems, also includes the ''Synchronisation Source'' SSRC identifier that is to be used for the RTP media stream that will subsequently emanate from the second SIP end point during the requested session. SDP has the ability to add additional "attributes" that are not specified in RFC2327. Such an attribute could be defined to carry the SSRC identifier.

As is known, the SSRC identifier is carried in the header of each RTP packet in a given session, and is a random 32-bit number that is required to be globally unique to that particular RTP session.

After the second SIP end point 3 has transmitted its 'OK'/200 response containing the SDP answer, the SIP end point 3 begins transmitting its RTP audio stream to the first SIP end point 2. Each of the packets in this stream contains the same SSRC identifier as that included in the second SIP end point's 'O '/200 response.

The third SIP end point 4 is answered after the second SIP end point 3. It too transmits a SIP 'OK'/200 response which includes an SDP answer identifying the media capabilities of the third SIP end point 4, and an SSRC identifier that is to be used for the RTP media stream that will emanate from the third SIP end point 4 during the requested session. Naturally, this SSRC identifier is different from that of the second SIP end point 3.

In this example, the media stream from the second SIP end point 3 arrives at the first SIP end point 2 before the media stream from the third SIP end point 4 does. Initially therefore, the media stream from the second SIP end point 3 is rendered by the first SIP end point 2, whilst the media stream from the third SIP end point 4 is ignored by the first SIP end point 2 when it arrives.

However, the SIP 'OK'/200 SDP response the third SIP end point 4 arrives at the proxy 5 before the SIP 'OK'/200 SDP response from the second SIP end point 3 does. The proxy 5 accepts the SIP 'OK'/200 response from the third SIP end point 4, and forwards it to the first SIP end point 2. The proxy 5 then transmits a 'cancel' message to the second SIP end point 3, to cancel the session request. On reception of the SIP 'OK'/200 response from the third SIP end point 4, the first SIP end point 2 compares the SSRC identifier included in the SDP answer with that contained in the media stream received from the second SIP end point 3 and determines that the SSRC identifiers do not match. The first SIP end point 2 compares the SSRC identifier included in the SIP 'OK'/200 response with that contained in the media stream received from the third SIP end point 4 and determines that the SSRC identifiers do match. The first SIP end point 2 discards the RTP stream from the second SIP end point 3 having the non-matching SSRC identifier and instead renders the stream from the third SIP end point 4 having the matching SSRC identifier.

Thus, the inclusion of the SSRC identifier in the SDP answer of the SIP response message allows a calling device to correlate that SSRC with those included in received media streams. If the SSRC identifier of the media stream initially rendered by the calling device does not match that in the SDP answer, this media stream may be discarded in favour of another received media stream that does have a matching SSRC identifier. The switching from the initial wrong choice of received media stream to the correct media stream may be made as soon as the SIP response including the SDP answer is received. This switching is thus performed quicker ^"than in known systems, where the switching can only be performed after it has been detected that the initial rendered media stream has ceased.

It is important to realise that in current systems , the source IP address and/or port signalled in the underlying network and transport layers could not be used in this way to indicate which RTP streams should be rendered because some devices can send and receive on different addresses / ports and because of the impact of Network Address Translation (NAT) devices in the network.

Referring back to Figure 1, following the reception of the SIP 'OK'/200 response from the third SIP end point 4, the first SIP end point 2 transmits a SIP 'ACK' message to the third SIP end point 4. The first SIP end point 2 then begins to transmit its media stream to the third SIP end point 4. Following the reception of the 'SIP 'Cancel' message from the proxy 5, the second SIP end point 4 responds with an 'OK/200' response.

Following the reception at the first SIP end point 2 of the SIP'OK'/200 SDP answer containing response from the second SIP end point 3 and prior to the final data packet from the second SIP end point 3 being received at the first SIP end point 2, a SIP 'ACK' message followed by a SIP 'BYE' message are sent from the first end point 2 to the second end point 3 and a SIP 'OK/200' response is sent from the second end point 3 to the first end point 2.

The invention may find application in other systemws . For example, when an INVITE request is forked to different destinations, one or more of which is in a circuit-switched network (e.g., a Public Switched Telephone Network,) reachable via a gateway. Often, in the absence of an answer signal from the other network, the gateway does not know whether important audio information is being received (e.g., tones or announcements) , and therefore must start transmitting RTP packets without waiting for an answer signal. In this case the calling device can receive several RTP streams before it receives a response to the SIP INVITE request and must choose to render one of them. However, it is- important that it starts to render the correct RTP stream as soon as answer occurs (either from a device in the IP network or a device in another network) . The presence of an SSRC identifier in the SIP response message can allow this to happen. Although the invention is described with respect to SDP and SIP, other alternative signalling protocols may be used in embodiments of the invention, for example, H.323 or other proprietary protocols.

Having thus described the present invention by reference to preferred embodiments it is to be well understood that the embodiments in question are exemplary only and that modifications and variations such as will occur to those possessed of appropriate knowledge and skills may be made without departure from the scope of the invention as set forth in the appended claims .

Claims

1. A method of rendering a real time media stream to a calling device in a packet based network, when a call from said calling device has been forked to a plurality of destinations and each of said destinations has transmitted a real time media stream to said calling device in response to said call; the method comprising: receiving a call set up message at the calling device, the message signal including an identifier identifying a correct real time media stream to be rendered at the calling device; comparing at the calling device the identifier included in the message signal with identifiers included in real time media streams received at the calling device from the destinations; and rendering to the calling device a media stream having an identifier that correlates with that included in the call set up message.

2. A method according to claim 1, the method comprising; discarding at the calling device an initially rendered media stream in response to determining that the identifier included in that media stream fails to correlate with the identifier in the call set up message.

3. A method according to claim 1 or 2 wherein the call set up message is a SIP message.

4. A method according to any preceding claim wherein the 5 identifier included in the call set up message is included in an SDP session description.

5. A method according to any preceding claim wherein the identifier is an SSRC identifier.

10 6. A method according to any preceding claim wherein, the call set up message is transmitted to the calling device via a network server device and is the first such message received at the network server device from any of the

^'15 plurality of destinations to which the call from the calling device is forked. ,

7. An apparatus comprising means for implementing the method of any of claims 1 to 5.

20 8. A call signalling protocol for a packet based network adapted to implement the method of claims 1 to 7.