CN115102927A

CN115102927A - SIP (Session initiation protocol) talkback method, system and storage device for keeping video clear

Info

Publication number: CN115102927A
Application number: CN202210468252.1A
Authority: CN
Inventors: 庄宗辉; 叶智鑫; 卢刚
Original assignee: Xiamen Leelen Technology Co Ltd
Current assignee: Xiamen Leelen Technology Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-09-23
Anticipated expiration: 2042-04-29
Also published as: CN115102927B

Abstract

The invention relates to an SIP (session initiation protocol) talkback method, a system and a storage device for keeping a video clear, wherein the method comprises the following steps of: after communication between the extension or mobile terminal and the host, performing: the sending end encodes and sends IDR frames to the receiving end every GOP time, and the receiving end decodes the IDR frames and outputs corresponding images; when the receiving end detects that the RTP sequence number is discontinuous or two continuous frames fail to decode in the video decoding, the receiving end requests the sending end to retransmit the last IDR frame. When the receiving end detects the packet loss, the method informs the sending end to encode the previous frame into the IDR frame to be used as the next IDR frame to be sent, avoids the traditional working mode of continuously requesting the loss to request the sending end to retransmit the corresponding RTP packet, can directly and completely play the IDR frame at the receiving end, and enables a user to obtain a clear image.

Description

SIP (Session initiation protocol) talkback method, system and storage device for keeping video clear

Technical Field

The invention relates to the field of SIP (session initiation protocol) talkback, in particular to a SIP talkback method, a system and a storage device for keeping a video clear.

Background

In the building intercom system, a plurality of hosts and a plurality of extension sets are installed in a local area network, SIP intercom software is installed on a mobile phone of a user, and the hosts can call the extension sets or the mobile phone through the network to realize SIP intercom. In practical application, the host can call several extension sets and several mobile phones in the user's home at the same time, and needs to send early video to these extension sets and mobile phones when the host rings, and the extension sets and mobile phones ring to receive and display the host video.

The existing building intercom system has the situation that video snowflakes, mosaics and the like appear in the video process of a user due to the situation of network congestion. The reason is that the receiving end sends NACK message of RTCP to the transmitting end, and requests the transmitting end to retransmit a corresponding RTP packet, and when network congestion is severe, a large amount of retransmission requests aggravate the situation of network congestion, and more snowflakes or mosaics are generated, as shown in fig. 1.

The invention aims to design an SIP talkback method, a system and a storage device for keeping a video clear aiming at the problems in the prior art.

Disclosure of Invention

In view of the problems in the prior art, the present invention provides an SIP intercom method, system and storage device for keeping video clear, which can effectively solve the problems in the prior art.

The technical scheme of the invention is as follows:

an SIP talkback method for keeping video clear under the condition of network congestion, which comprises the following steps:

after communication between the extension or mobile terminal and the host, performing:

the sending end encodes and sends IDR frames to the receiving end every GOP time, and the receiving end decodes the IDR frames and outputs corresponding images;

when the receiving end detects that the RTP sequence number is discontinuous or two continuous frames fail to decode in the video decoding, the receiving end requests the sending end to retransmit the last IDR frame.

Further, the requesting the sending end to retransmit the last IDR frame includes:

the receiving end sends a packet loss retransmission message to the sending end;

and the sending end defines the last IDR frame as a next frame and sends the next frame to the receiving end.

Further, after the request for retransmission of the last IDR frame is sent to the sending end, the following steps are performed:

and after delaying the second time, the sending end codes again at intervals of GOP time and sends the IDR frame to the receiving end.

Further, after the communication between the extension or the mobile terminal and the host, performing:

the priority of transmitting audio is set to be greater than the priority of transmitting video.

Further, the setting of the priority of the transmission audio to be greater than the priority of the transmission video includes:

defining audio as a main axis and video as an auxiliary axis, and caching the audio and the video by a receiving end according to a first-in first-out principle;

and when the receiving end obtains the audio and/or the video every time, calculating the time difference value of the time stamp of the audio and the time stamp of the video in the current frame, and adjusting the playing speed of the video in the current frame according to the time difference value.

Further, the adjusting the playing speed of the video in the current frame according to the time difference value includes:

if the time difference is less than or equal to 250ms, the playing speed of the video in the current frame is not adjusted;

if the time difference value is greater than 250ms, the video is faster than the audio, and the buffer amount of the video is less than a first amount, slowing down or pausing the playing speed of the video in the current frame until the time difference value is less than or equal to 250 ms;

if the time difference value is greater than 250ms, the video is faster than the audio, and the buffer number of the video is greater than a first number, the playing speed of the video in the current frame is not adjusted;

if the time difference value is larger than 250ms and the video is slower than the audio, the playing speed of the video in the current frame is increased until the time difference value is smaller than or equal to 250 ms.

Further, the method is based on a one-call multi-SIP system fusing a local area network and a wide area network, and is characterized in that: the mobile terminal comprises a host, a plurality of extensions and a plurality of mobile terminals, wherein the host is connected with the extensions through a local area network, and the host is connected with the mobile terminals through a wide area network.

Further, before the communication between the extension or the mobile terminal and the host, performing:

establishing connection between a host and a plurality of extension sets through a local area network, establishing connection between the host and a plurality of mobile terminals through a wide area network, and registering the host and the mobile terminals to an SIP server;

when any extension or any mobile terminal initiates answering, the host sends a call hang-up instruction to other extensions or other mobile terminals, and communication between the corresponding extension or mobile terminal and the host is established.

Further provides an SIP intercom system for keeping clear video under the condition of network congestion, which comprises the following modules:

the encoding and sending module is used for encoding every GOP time by the sending end and sending the IDR frame to the receiving end, and the receiving end decodes the IDR frame and outputs a corresponding image;

and the request module is used for requesting the sending end to retransmit the last IDR frame when the receiving end detects that the RTP sequence number is discontinuous or the decoding of two continuous frames fails in the video decoding.

There is further provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the SIP talkback method for keeping video clear in case of network congestion.

Accordingly, the present invention provides the following effects and/or advantages:

when the receiving end detects the packet loss, the method informs the sending end to encode the previous frame into the IDR frame to be used as the next IDR frame to be sent, avoids the traditional working mode of continuously requesting the loss to request the sending end to retransmit the corresponding RTP packet, can directly and completely play the IDR frame at the receiving end, and enables a user to obtain a clear image.

Due to network congestion and other conditions, after the last frame of IDR frame is requested and the second time is delayed, the sending end encodes again at intervals of GOP time and sends the IDR frame to the receiving end. Thereby preventing further congestion caused by the receiver continuously requesting to resend IDR frames in case of network congestion. Meanwhile, due to network congestion under the current condition, data transmission or data reception can be reduced within the time period of delaying the second time, and the congestion condition is relieved.

In the method, the priority of transmitting the audio is set to be greater than the priority of transmitting the video. Because the audio data is often smaller than the video data, the audio can be played directly and the video can be slowly adjusted to match the timestamps of the two.

The method is based on an SIP talkback system consisting of a local area network and a wide area network, only the host and the mobile terminal are registered to an SIP server, and the extension is not registered to the SIP server. The terminal in the local area network directly carries out data communication without passing through the SIP server, so that the communication pressure of the SIP server can be reduced, and the video conversation quality is improved.

It is to be understood that both the foregoing general description and the following detailed description of the present invention are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.

Drawings

Fig. 1 is a video screenshot of a prior art SIP walkback method in the case of network congestion.

Fig. 2 is a flow chart of a method provided herein.

Fig. 3 is a video screenshot of the SIP talkback method provided by the present invention under the condition of network congestion.

Fig. 4 is a block diagram of an SIP intercom system provided by the present application.

Fig. 5 is a logic/timing diagram of the SIP intercom method provided in the present application.

Fig. 6 is a system block diagram of a conventional SIP intercom.

Fig. 7 is a logic/timing diagram of a conventional SIP walkie-talkie.

Detailed Description

To facilitate understanding for those skilled in the art, the present invention will now be described in further detail with reference to the following drawings: it should be understood that the steps mentioned in the embodiment, unless the order is specifically stated, may be performed in any order, or may even be performed simultaneously or partially simultaneously.

As a result of the prior art techniques,

referring to fig. 2, an SIP talkback method for keeping a clear video under a network congestion condition includes the following steps:

s1, connecting a host with a plurality of extensions through a local area network, connecting the host with a plurality of mobile terminals through a wide area network, and registering the host and the mobile terminals to an SIP server;

SIP is part of IETF standard processes that are built on a basis such as SMTP and HTTP. It is used to set up, change and terminate calls between users of an IP-based network. It also requires the incorporation of different standards and protocols in order to provide telephony services: in particular, there is a need to ensure transmission, signaling interconnection with the current telephone network, to be able to ensure voice quality, to be able to provide a directory, to be able to authenticate a user, etc. A host refers to a terminal for initiating video or voice communication to other multiple users/terminals/extensions.

In this embodiment, only the host and the mobile terminal are registered to the SIP server, and the extension is not registered to the SIP server. The host and the mobile terminal are registered to the SIP server, namely, the extension set or the mobile terminal periodically sends a registration request (REGISTER) to the network and reports information such as the current IP address, the user name and the like of the extension set or the mobile terminal. Thereafter, the SIP server always stores information of the extension or the mobile terminal.

In the case, when a user initiates a call, a call request instruction is sent to the extension and the mobile terminal through the host based on the SIP protocol; when a user initiates a call through the host, the host initiates a call to an extension in a local area network and initiates a call to a mobile terminal in a wide area network simultaneously through an SIP protocol. In the ringing process, when any extension or any mobile terminal initiates hang-up, a call canceling instruction is sent to the host through the corresponding extension or the mobile terminal; if any extension or any mobile terminal does not want to be connected or does not want to continuously ring, the extension or the mobile terminal hangs up the ring through the corresponding extension or the mobile terminal, and the extension or the mobile terminal sends a call Cancel instruction (Cancel) to the host. At this moment, the extension set or the mobile terminal stops ringing, and the host does not continuously send instructions and media data to the extension set or the mobile terminal any more, and meanwhile, other extension sets or the mobile terminal are not influenced and continue to keep a ringing state. When any extension or any mobile terminal initiates answering, the host sends a call hang-up instruction to other extensions or other mobile terminals, and communication between the corresponding extension or mobile terminal and the host is established.

In this step, when the user answers the call instruction of the host through the extension or the mobile terminal, that is, it indicates that there is no need for establishing communication with the host any more for other extensions or mobile terminals, the user sends a call hangup instruction to other extensions or mobile terminals through the host, and at this time, the other extensions or mobile terminals do not ring any more. Meanwhile, communication is established between the extension or the mobile terminal and the host, and a user starts to talk or video between the extension or the mobile terminal and the host.

Through the steps, the method can be used for hanging up an extension set or a mobile terminal during ringing without influencing other equipment to continuously keep the ringing state, and when any called equipment answers, the host hangs up other calls and carries out audio and video talkback with the equipment. In addition, in the prior art, the extension and the mobile phone are both registered to the SIP server, the extension local area network is provided with an outlet to the wide area network, and even if only the extension exists, communication data between the extension and the host also needs to be wound around the SIP server and then return to the local area network, so that the communication complexity between the extension and the host is increased, and the communication quality is reduced. The host and the extension are connected through the local area network, and the SIP server is not needed to transfer data in the local area network, so that the host does not need to register the information of the extension to the SIP server, the host can directly call the extension of the local area network, or the host calls the mobile terminal of the wide area network through the SIP server.

Some optimization directions of the present embodiment are described below.

S2, the sending end encodes and sends IDR frames to the receiving end every GOP time, and the receiving end decodes the IDR frames and outputs corresponding images;

s3, when the receiving end detects the RTP sequence number is discontinuous or the two continuous frames are failed to decode in the video decoding, the receiving end requests the sending end to retransmit the last IDR frame.

In this embodiment, in h.264, the I frame is divided into a general I frame and an IDR frame (special I frame); an IDR frame blocks the accumulation of errors, frames following an IDR frame cannot refer to frames preceding the IDR frame, and normal I frames do not block the accumulation of errors. The IDR frame must be an I frame, but the I frame is not necessarily an IDR frame. In H264, pictures are organized in units of a sequence, a sequence being a segment of the coded data stream of pictures, starting with an I-frame and ending with the next I-frame. The first picture of a sequence is called an IDR picture, and when the decoder receives an IDR frame, all reference frame queues are discarded. "GOP time" means the interval duration between two I-frames. In this embodiment, the GOP time is generally 1 to 2 seconds according to the requirement of video communication, and this embodiment adopts 1 second.

In step S2, the IDR frame is encoded and sent to the receiving end every GOP time, so that the receiving end can preferentially parse the IDR frame and directly output the IDR frame, and the screen of the receiving end immediately obtains a picture, as shown in fig. 3. When network congestion occurs, in the prior art, a receiving end needs to send a NACK message of an RTCP to a sending end, and the sending end needs to retransmit a corresponding RTP packet, and when the network congestion is severe, a picture as shown in fig. 1 may occur and is full of mosaics and the like. In order to avoid this phenomenon, in this embodiment, step S2 is performed, this step discards the NACK message of RTCP sent by the receiving end to the sending end, and requests the sending end to retransmit the corresponding RTP packet, and switches to request the sending end to retransmit the previous IDR frame, where one IDR frame includes a complete image, and when the receiving end detects that the RTP sequence number is discontinuous or that two consecutive frames fail to be decoded in video decoding, this step reappears a complete image, and at this time, the receiving end decodes the previous IDR frame as shown in fig. 3 again, so that situations such as snowflakes, mosaics, and the like do not occur.

And the sending end sends 1I frame every 1s within 3 seconds after the sending end starts to send the video, so that the receiving end can rapidly output stable video images. Within these 3 seconds, FIR requests sent from the receiving end are not processed.

After 3 seconds, after receiving the RTCP PLI or FIR message sent by the receiving end, the sending end notifies the application layer through the RTCP event to trigger the next frame to directly transmit an IDR frame, but the minimum interval of the IDR frame is guaranteed to be 300 ms. If the I frame is just sent, in the process of continuing to send the P (B) frame, receiving the retransmission request of the other side I frame within 300ms, stopping sending the P (B) frame until 300ms later, and sending the I frame. If any number of I frame retransmission requests are received during the period, the I frames are processed after 300ms, and are transmitted only once.

Specifically, the requesting the sending end to retransmit the last IDR frame includes:

s3.1, the receiving end sends a packet loss retransmission message to the sending end;

and S3.2, the sending end defines the last IDR frame as a next frame and sends the next frame to the receiving end. In the prior art, after receiving an IDR frame, a receiving end deletes all previous buffer frames, and in this step, a sending end redefines a previous IDR frame as a next IDR frame, so that the sending end can send the previous IDR frame again under the request of the receiving end.

Further, after requesting the sending end to retransmit the last IDR frame, the following steps are performed:

and S4, after delaying the second time, the sending end encodes again at every GOP time and sends the IDR frame to the receiving end.

In this step, since the receiving end detects the packet loss in step S3.2, the RTCP notifies the sender to encode the previous frame into an IDR frame to be sent as the next IDR frame, and due to the network congestion, the sending frequency of sending the next IDR frame can be reduced after the sending end has executed S3.2, that is, the GOP time is extended, thereby preventing the receiving end from continuously requesting to resend the IDR frame to cause further congestion in the case of network congestion. In this embodiment, the second time may be set to 10 to 15s, and this embodiment specifically adopts 10 s. That is, in step S4, the process is optimized to wait for 10 seconds before the sender encodes and sends IDR frames to the receiver again at intervals of GOP time in step S2, and the receiver decodes the IDR frames and outputs corresponding pictures.

Further, under the condition of network congestion, video and audio analyzed by a receiving end are asynchronous, and a user considers that lip action in the video is asynchronous with sound in the video communication process.

After the communication between the corresponding extension or mobile terminal and the host is established, the following steps are executed:

s5, the priority of transmitting audio is set to be greater than the priority of transmitting video.

In this step, priority audio > video. Because it is guaranteed that audio is transmitted normally and preferentially, when the receiving end receives audio data, the audio data is directly played, and when the receiving end receives video data, the audio data needs to be played by referring to the audio time stamp. If the network is poor when the video is used as the timestamp, the audio is delayed to play in the whole process for matching the video.

Specifically, the setting of the priority of the transmission audio to be higher than the priority of the transmission video includes:

s5.1, defining audio as a main axis and video as an auxiliary axis, and caching the audio and the video by a receiving end according to a first-in first-out principle; in this embodiment, the audio and video are buffered for 0.6 second first, the maximum buffer area is set to 2 seconds, that is, 100 audio data are stored, 50 video data are stored, and the first data are deleted when the percentage of the video data exceeds 10 percent of the buffer area

S5.2, when the receiving end obtains the audio and/or the video every time, calculating the time difference value of the audio time stamp and the video time stamp in the current frame, and adjusting the playing speed of the video in the current frame according to the time difference value.

Specifically, the adjusting the playing speed of the video in the current frame according to the time difference includes:

if the time difference is less than or equal to 250ms, the playing speed of the video in the current frame is not adjusted; this is because the user does not strongly perceive lip-sync for video and audio having a time difference of 250ms or less, and the time difference may be a video time stamp-audio time stamp or an audio time stamp-video time stamp.

If the time difference value is greater than 250ms, the video is faster than the audio, and the buffer amount of the video is less than a first amount, slowing down or pausing the playing speed of the video in the current frame until the time difference value is less than or equal to 250 ms; in the present case, video is faster than audio, so it is necessary to play video slowly or pause it to synchronize between audio and video.

If the time difference value is greater than 250ms, the video is faster than the audio, and the buffer amount of the video is greater than a first amount, not adjusting the playing speed of the video in the current frame; in this case, there is more video than audio in the buffer, and there is a possibility that the video buffer is reduced due to the loss of video packets due to network congestion or the like, so that the video is not processed slowly or fast in this case, and then is naturally resynchronized in the case of network congestion.

If the time difference value is greater than 250ms and the video is slower than the audio, the playing speed of the video in the current frame is increased until the time difference value is less than or equal to 250 ms. When network congestion occurs and video and audio data are received asynchronously, and the number of video buffers possibly exceeds the preset number, the video needs to be output and played quickly.

In the present embodiment, the first number may be 50. And may be any number from 10-100 in other embodiments.

Referring to fig. 4, the method is based on a one-call multi-SIP system that merges a local area network and a wide area network, and includes a host, a plurality of extensions, and a plurality of mobile terminals, where the host establishes a connection with the plurality of extensions through the local area network, and the host establishes a connection with the plurality of mobile terminals through the wide area network. The workflow of a one-call multi-SIP system merging local area network and wide area network can be referred to fig. 5. Existing SIP systems and their workflow refer to fig. 6-7.

The contrast is obvious, and the host computer can directly communicate with the extension set without passing through the SIP server in the system, so that the communication link that the data of the host computer is transmitted to the SIP server and then transmitted to the extension set is reduced, and the data is directly transmitted to the extension set from the host computer, thereby further improving the definition and the smoothness of the video.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

In the description of the present invention, it is to be understood that the terms "first", "second", and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to imply that the number of technical features indicated are in fact significant. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the present invention, unless otherwise explicitly stated or limited, the terms "mounted," "connected," "fixed," and the like are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrated; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above should not be understood to necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Claims

1. A SIP talkback method for keeping clear video under the condition of network congestion is characterized in that: the method comprises the following steps:

2. The SIP talkback method for keeping video clear under the condition of network congestion according to claim 1, characterized in that: the requesting the retransmission of the last IDR frame to the transmitting end includes:

3. The SIP talkback method for keeping video clear under the condition of network congestion according to claim 1, characterized in that: after the request for retransmitting the last IDR frame is sent to the sending end, the following steps are executed:

and after delaying the second time, the sending end codes again every GOP time and sends the IDR frame to the receiving end.

4. The SIP talkback method for keeping video clear under the condition of network congestion according to claim 1, characterized in that: after communication between the extension or mobile terminal and the host, performing:

5. The SIP talkback method for keeping video clear under the condition of network congestion according to claim 4, characterized in that: the setting of the priority of the transmission audio to be greater than the priority of the transmission video includes:

6. The SIP talkback method for keeping video clear under the condition of network congestion according to claim 5, characterized in that: the adjusting the playing speed of the video in the current frame according to the time difference comprises:

if the time difference value is greater than 250ms, the video is faster than the audio, and the buffer amount of the video is greater than a first amount, not adjusting the playing speed of the video in the current frame;

if the time difference value is greater than 250ms and the video is slower than the audio, the playing speed of the video in the current frame is increased until the time difference value is less than or equal to 250 ms.

7. The SIP talkback method for keeping video clear under the condition of network congestion according to claim 1, characterized in that: the method is based on a one-call multi-SIP system fusing a local area network and a wide area network, and is characterized in that: the mobile terminal comprises a host, a plurality of extensions and a plurality of mobile terminals, wherein the host is connected with the extensions through a local area network, and the host is connected with the mobile terminals through a wide area network.

8. The SIP intercom method capable of keeping video clear under the condition of network congestion according to claim 7, wherein the SIP intercom method comprises the following steps: before communication between an extension or a mobile terminal and a host, performing:

9. The utility model provides a keep clear SIP intercom system of video under network congestion condition which characterized in that: the system comprises the following modules:

the coding sending module is used for coding every GOP time by the sending end and sending the IDR frame to the receiving end, and the receiving end decodes the IDR frame and outputs a corresponding image;

10. A computer-readable storage medium, storing a computer program, wherein the computer program, when executed by a processor, implements a SIP talk back method for maintaining video clarity in a network congestion situation according to any of claims 1 to 9.