CN109194974B

CN109194974B - Media low-delay communication method and system for network video live broadcast

Info

Publication number: CN109194974B
Application number: CN201811136034.8A
Authority: CN
Inventors: 尹丹丹; 杨罡
Original assignee: Beijing Net Hi Tech Co Ltd
Current assignee: Beijing net Hi Tech Co., Ltd
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2020-12-01
Anticipated expiration: 2038-09-28
Also published as: CN109194974A

Abstract

The invention provides a media low-delay communication method and a media low-delay communication system for network video live broadcast, wherein the method comprises the following steps: judging whether the scene change degree between the obtained current video frame image and the previous video frame image exceeds a preset scene change threshold value, respectively performing data sub-packaging on the current video frame and the obtained corresponding audio frame, adding corresponding scene marks in each video data packet and each audio data packet according to the judgment result of whether the scene change degree exceeds the preset scene change threshold value, buffering the video data packets and the audio data packets, adjusting the sending priorities of the video data packets and the audio data packets according to the scene marks, sending the video data packets and the audio data packets according to the sending priorities, and buffering the received video data packets and the audio data packets according to the scene marks by a receiver so as to realize audio and video data playing. The method can adjust the priority of audio and video transmission so as to send important data preferentially when network communication is delayed or unstable.

Description

Media low-delay communication method and system for network video live broadcast

Technical Field

The invention relates to the technical field of network communication, in particular to a media low-delay communication method and a media low-delay communication system for network video live broadcast.

Background

The global internet has expanded rapidly since the nineties of the last century and has become an important information infrastructure for the economic development and social progress of today's world. Since the rise of the internet, various applications that directly or indirectly utilize the internet have been developed to facilitate the daily life of people, and a communication method developed based on the internet, such as video conferencing and live webcasting, is performed by using the network, so that multiple participants at a relatively long distance can perform image and sound communication, and each participant can view images and sounds displayed by other participants, thereby greatly facilitating the process of information communication of the participants.

For example, when network video live broadcasting is performed by using the internet, a live broadcasting party can send an audio and video containing the playing content to a watching party, and the watching party does not need to go to a specific place such as a performance site and a competition site to watch a performance or a competition, but can send/receive audio and video information at any position capable of being connected with the network, so that network video live broadcasting is realized. Before a sender (namely a live broadcast party) sends audio and video to a receiver (namely a viewer), the audio and video are coded to obtain a frame-by-frame video frame and an audio frame, then the video frame and the audio frame are sequentially transmitted to the receiver through a network, and an audio and video file is obtained through decoding by the receiver and played.

In the process of implementing the network video live broadcast, communication between a live broadcast party and a viewer may be interfered due to the reasons of low network transmission speed or unstable network, and the like, so that a signal delay phenomenon occurs in the live broadcast process, images and sounds are discontinuous or asynchronous, and the viewer cannot normally view and listen to live broadcast contents in the signal delay process, thereby generating great negative effects on the video live broadcast process.

Disclosure of Invention

Objects of the invention

In order to overcome at least one defect in the prior art and make up for the content that a viewer cannot view and listen to the live network video due to poor network state, the invention provides the following technical scheme.

(II) technical scheme

As a first aspect of the invention, the invention discloses a media low-delay communication method for network video live broadcast, which comprises the following steps:

judging whether the scene change degree between the obtained current video frame image and the previous video frame image exceeds a preset scene change threshold value or not;

respectively performing data sub-packaging on the current video frame and the acquired corresponding audio frame, and adding corresponding scene marks in each video data packet and each audio data packet according to a judgment result of whether the scene change degree exceeds a preset scene change threshold value;

buffering the video data packet and the audio data packet, adjusting the sending priority of the video data packet and the audio data packet according to the scene mark, and sending the video data packet and the audio data packet according to the sending priority;

and the receiver buffers the received video data packet and the audio data packet according to the scene mark so as to realize the playing of the audio and video data.

In a possible implementation manner, the acquired video frame and the acquired audio frame are obtained by encoding audio and video recorded in real time.

In a possible implementation manner, whether the scene change degree exceeds a preset scene change threshold is judged through an inter-frame difference method and/or a background difference method.

In a possible implementation, the buffer size of the receiver is variable, which is adjusted depending on the scene marker.

In a possible implementation manner, before the separately packetizing the current video frame and the obtained corresponding audio frame, the method further includes: adding corresponding time marks in each video frame and each audio frame; and the number of the first and second electrodes,

after the receiving side buffers the received video data packet and the audio data packet according to the scene mark, the method further comprises the following steps: and limiting the time length of the data with high sending priority to be ahead of the time length of the data with low sending priority to be played according to the time identifier carried by each data packet and a preset time length threshold.

In a possible embodiment, the duration threshold is variable and is adjusted in dependence on the scene mark.

As a second aspect of the present invention, the present invention discloses a media low-delay communication system for network video live broadcast, comprising:

the scene judging module is used for judging whether the scene change degree between the obtained current video frame image and the previous video frame image exceeds a preset scene change threshold value or not;

the data sub-packaging module is used for respectively performing data sub-packaging on the current video frame and the acquired corresponding audio frame;

a scene identification module, configured to add, after the data packetization, a corresponding scene marker to each of the video data packet and the audio data packet according to a determination result of whether the scene change degree exceeds a preset scene change threshold;

a sending buffer module, configured to buffer the video data packet and the audio data packet after the corresponding scene tag is added, adjust sending priorities of the video data packet and the audio data packet according to the scene tag, and send the video data packet and the audio data packet according to the sending priorities;

and the receiving buffer module is used for buffering the received video data packet and the audio data packet according to the scene mark so as to realize the playing of audio and video data.

In one possible embodiment, the system further comprises:

and the coding and recording module is used for recording the audio and video in real time and coding the audio and video to obtain the acquired video frame and the acquired audio frame.

In a possible implementation manner, the scene determination module includes: the device comprises a first judging unit and/or a second judging unit, wherein the first judging unit is used for judging whether the scene change degree exceeds a preset scene change threshold value through an inter-frame difference method, and the second judging unit is used for judging whether the scene change degree exceeds the preset scene change threshold value through a background difference method.

In one possible implementation, the transmission buffer module includes: and the first adjusting unit is used for adjusting the buffer size of the sending buffer module according to the scene mark.

In one possible embodiment, the system further comprises:

the time identification module is used for adding corresponding time identifications in each video frame and each audio frame before the data subpackaging;

and the time difference adjusting module is used for limiting the time length of the data with high sending priority to be ahead of the time length of the data with low sending priority for playing according to the time identifier carried by each data packet and a preset time length threshold after the receiving buffer module buffers the data packets.

In a possible implementation manner, the duration threshold is variable, and the time difference adjusting module adjusts the duration threshold according to the scene mark.

(III) advantageous effects

The media low-delay communication method and the media low-delay communication system for the network video live broadcast can judge the video content in the audio and video containing the live broadcast content sent by a live broadcast party to adjust the priority of video transmission and audio transmission so as to send important data preferentially when the network video live broadcast is delayed or unstable, reduce the interference generated by network delay and ensure that a receiving party can not interrupt and leak the live broadcast key content to the maximum extent; meanwhile, the time identification is added in the audio and video data to limit the suspension time of the data packet which is suspended to be sent, so that the data packet with low priority is prevented from lagging too much.

Drawings

The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining and illustrating the present invention and should not be construed as limiting the scope of the present invention.

Fig. 1 is a schematic flow chart of a first embodiment of a media low-latency communication method for network video live broadcast provided by the invention.

Fig. 2 is a flowchart illustrating a second embodiment of the media low-latency communication method for network video live broadcast according to the present invention.

Fig. 3 is a block diagram of a first embodiment of a media low-latency communication system for network video live broadcast according to the present invention.

Fig. 4 is a block diagram of a second embodiment of the media low-latency communication system for network video live broadcast provided by the invention.

Detailed Description

In order to make the implementation objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be described in more detail below with reference to the accompanying drawings in the embodiments of the present invention.

It should be noted that: in the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described are some embodiments of the present invention, not all embodiments, and features in embodiments and embodiments in the present application may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this document, "first", "second", and the like are used only for distinguishing one from another, and do not indicate their degree of importance, order, and the like.

The division of modules, units or components herein is merely a logical division, and other divisions may be possible in an actual implementation, for example, a plurality of modules and/or units may be combined or integrated in another system. Modules, units, or components described as separate parts may or may not be physically separate. The components displayed as cells may or may not be physical cells, and may be located in a specific place or distributed in grid cells. Therefore, some or all of the units can be selected according to actual needs to implement the scheme of the embodiment.

A first embodiment of the media low-latency communication method for network video live broadcast provided by the present invention is described in detail below with reference to fig. 1. The embodiment is mainly applied to network video live broadcast, and can judge the video content in the audio and video containing the network live broadcast content sent by a sender to adjust the priority of video transmission and audio transmission, so as to preferentially send important data when network communication is delayed or unstable, reduce interference generated by network delay, and ensure that a receiver can not interrupt and leak the key content of the network live broadcast to the greatest extent.

As shown in fig. 1, the media low-latency communication method provided in this embodiment includes the following steps:

step 100, judging whether the scene change degree between the obtained current video frame image and the previous video frame image exceeds a preset scene change threshold value.

One video frame is a static image, a plurality of video frames are combined to form a video, and the same principle is applied to audio frames. In the process of obtaining the video frame and the audio frame by encoding the audio and video, a same scene judgment mechanism is applied to the video frame, namely, the video frame Vn which needs to be sent to a receiving party at present and the image of the previous video frame Vn-1 are identified, the scene change degree of the two adjacent frames of images is obtained, and whether the scene change degree Sn exceeds a preset scene change threshold value St is judged.

If the scene change degree Sn exceeds St, it indicates that the video frame Vn and the video frame Vn-1 have a significant change in the scene. For example, the picture of the video frame Vn is the scene of the audience under the stage of the concert, and the picture of the video frame Vn-1 is the scene of the singer singing on the stage, in which case the scene change degree of the two frames of pictures exceeds the scene change threshold St, and it is determined that there is a significant change.

If the scene change degree Sn does not exceed St, it indicates that the video frame Vn and the video frame Vn-1 have not significantly changed in the scene. For example, the picture of the video frame Vn and the picture of the video frame Vn-1 are both scenes in which a singer sings on a stage, and in this case, the scene change degree of the two frames of pictures does not exceed the scene change threshold St, and it is determined that there is no significant change.

It should be noted that a segment of audio/video includes many video frames and audio frames, and except for the first video frame and the first audio frame, all subsequent video frames need to be compared with the previous frame for the above-mentioned scene change degree.

It should be further noted that, the comparison of the scene change degree with the current video frame image may be performed not only with the previous video frame image but also with the previous nth video frame image, where N >1, that is, may also be performed with the previous second frame image, the previous third frame image, and so on, and frame skipping is performed.

Step 200, performing data sub-packaging on the current video frame and the obtained corresponding audio frame respectively, and adding corresponding scene marks in each video data packet and each audio data packet according to a judgment result of whether the scene change degree exceeds a preset scene change threshold value.

After judging whether the pictures of the video frame Vn and the video frame Vn-1 have significant changes or not, respectively performing data sub-packaging on the video frame Vn and the audio frame An corresponding to the video frame Vn. When the internet is used for data transmission, the data is divided into a plurality of data packets and sent in batches, and a receiver can integrate the data packets after receiving the data packets to obtain complete data capable of being played.

After the video frame Vn and the audio frame An are subjected to data sub-packaging, a plurality of video data packets and a plurality of audio data packets are obtained, and scene marks are added in each video data packet and each audio data packet. The added scene mark depends on whether the scene change degree before non-packetization exceeds a preset scene change threshold, if the scene of the video frame Vn is significantly changed relative to the video frame Vn-1, the added scene mark in the data packet of the video frame Vn is different from the added scene mark in the data packet of the video frame Vn-1, and if the scene of the video frame Vn is the scene of the audience under the concert scene stage and the scene of the singer singing the song on the stage is taken as An example in the case of the picture of the video frame Vn-1, and the added scene mark in the data packets of the video frame Vn-1 and the audio frame An-1 is Sm + 1. If the picture of the video frame Vn and the picture of the video frame Vn-1 have no significant change, for example, both are scenes in which a singer sings on a stage, the scene marks added in the data packets of the video frame Vn and the audio frame An are the same as those of the video frame Vn-1 and the audio frame An-1, and are both Sm.

It is understood that the data packets may be UDP data packets or TCP data packets. UDP (User Datagram Protocol) is a connectionless Protocol, is used to provide a transaction-oriented simple unreliable information transfer service, can support network applications that need to transmit data between computers, does not guarantee the reliability of a data packet to be transmitted, is suitable for transmitting a small amount of data at a time, and has a high transmission speed. TCP (Transmission Control Protocol) is a connection-oriented Protocol, which includes a special Transmission guarantee mechanism and can guarantee the sending and receiving sequence of data, but needs to occupy system resources, and the Transmission speed is slower than UDP under the same condition.

And step 300, buffering the video data packet and the audio data packet, adjusting the sending priority of the video data packet and the audio data packet according to the scene mark, and sending the video data packet and the audio data packet according to the sending priority.

The sender buffers the data by sending a buffer pool. After the video frame Vn and the audio frame An are respectively subjected to data packetization, the video data packet is sent to a video sending buffer area of a sending buffer pool, and the audio data packet is sent to An audio sending buffer area of the sending buffer pool. The buffer pool is used for buffering data and transmitting the data after accumulating the data to a certain degree.

After the video data packet and the audio data packet are sent to the sending buffer pool, the scene mark of the data packet of the current video frame and the scene mark of the data packet of the previous video frame need to be detected to judge whether the adjacent video frames belong to the same scene. If the scene marks are the same, the current video frame and the previous video frame are the same scene, and if the scene marks are different, the current video frame and the previous video frame are different scenes, so that a method for adjusting the sending priority under the same scene and a method for adjusting the sending priority under different scenes are divided.

It can be understood that, since the audio data packet is also added with the scene mark, whether the scene mark of the data packet of the current audio frame is the same as the scene mark of the data packet of the previous audio frame can be also known.

Under the condition that the current video frame and the previous video frame are in the same scene, the live broadcast picture of the current video frame is basically the same as the live broadcast picture of the previous video frame, for example, the pictures of the two frames are both scenes in which a singer sings on a stage, the main content of live broadcast at the moment is singing voice (audio) of the singer but not the face of the singer and the background (video) of the stage, so that the voice and the sound are more than the live broadcast content contained in the pictures, the sending priority of the audio data packet is adjusted to be higher than the sending priority of the video data packet, namely, when the pictures of the current frame and the previous frame are judged to be in the same scene, the proportion of the video data packet and the audio data packet sent by the sending buffer pool inclines to the audio data packet.

For example, at the initial time, the transmission buffer pool transmits m packets every millisecond, where the ratio of audio and video packets is 1: 9, after the data packet of the first frame video frame and the data packet of the first frame audio frame are sent out, the data packet of the second frame video frame and the data packet of the second frame audio frame enter a sending buffer pool, and when it is detected that the second frame video frame and the first frame video frame have the same scene, the ratio of the audio and video data packets in the sending buffer pool is adjusted to 2: 8, when the receiving party receives the same number of data packets, the audio data packets account for more data packets.

If the network state is good, the receiving side can receive the video data packet and the audio data packet in synchronization even if the transmission duty of the audio data packet is increased. If the network state suddenly deteriorates, for example, the network speed is slow, the network is unstable, and the like, because the sending proportion of the audio data packet is increased, the proportion of the audio data packet received by a receiving party (a watching party) is also increased, and the audio data is buffered and played out in advance compared with the video data, the picture of the network live broadcasting concert window can be jammed, but the singing voice of a singer can be kept continuous as much as possible, so that the receiving party can be ensured to be free from interruption and missing the focus of the concert to the greatest extent.

Under the condition that the current video frame and the previous video frame are different scenes, the difference between the live broadcast picture of the current video frame and the live broadcast picture of the previous video frame is larger, for example, the scene of audience under a stage of a concert of the previous frame is changed into the scene of singing a song on the stage by a singer of the current frame, at the moment, the singer possibly has wonderful action performance while singing the song, so that the picture information is possibly more important than the sound information or the picture and the sound information are equally important. At this time, the sending priority of the video data packet is adjusted to be higher than or equal to the sending priority of the audio data packet, that is, when the picture of the current frame and the picture of the previous frame are judged to be different scenes, the proportion of the video data packet and the audio data packet sent by the sending buffer pool is inclined towards the video data packet, or the proportion is restored to the original proportion.

For example, after the data packet of the nth frame video frame and the data packet of the nth frame audio frame are sent out, the data packet of the (n + 1) th frame video frame and the data packet of the (n + 1) th frame audio frame enter the sending buffer pool, and it is detected that the scenes of the (n + 1) th frame video frame and the nth frame video frame are different, then the proportion of the audio and video data packets in the sending buffer pool is from 2: 8, adjusting to 0.5: 9.5, when the receiving party receives the same number of data packets, the video data packets account for more data packets; or the ratio of the audio and video data packets is restored to the original ratio of 1: and 9, balancing the two emission modes.

If the network state is good, the receiver can receive the video data packet and the audio data packet in synchronization even if the transmission ratio of the video data packet is increased with respect to the initial ratio. If the network state is poor, the transmission occupation ratio of the video data packet is improved, the occupation ratio of the video data packet received by a receiver (a viewer) is also improved, the video data is buffered and played in advance compared with the audio data, the receiver can see that the sound of a network live broadcast window is possibly blocked, but the content of the action performance of a singer on site can be kept continuous as much as possible, and the receiver can be ensured to be free from interruption and missing the key content of the site action performance of the singer to the greatest extent. When the audio and video data packets are restored to the original normal proportion in the sending buffer pool, the audio and video data packets received by the receiving party (the watching party) are relatively balanced, and both the picture and the sound can be jammed, but the sound and the picture are relatively compatible with each other compared with the condition with priority.

As can be seen from the above, the sending priority is actually the bandwidth allocation of data transmission.

In the whole network live broadcast process, no matter the network environment is good, the data packet in the sending buffer pool is sent out according to the priority after the same scene or different scenes are judged according to the scene mark and the priority is regulated.

And 400, buffering the received video data packet and audio data packet by the receiver according to the scene mark so as to realize audio and video data playing.

The receiving side buffers data by receiving the buffer pool. After a sender (live broadcast party) sends out audio and video data packets according to sending priority, a receiver (viewer) receives the audio and video data packets and puts the audio and video data packets into a receiving buffer pool, wherein the audio data packets are put into an audio receiving buffer area, and the video data packets are put into a video receiving buffer area. After the audio and video are buffered to a certain data volume, the data packets are decoded to obtain video frames and audio frames according to scene marks and other characteristics carried by the data packets, and the video frames and the audio frames are played frame by frame, so that a receiver can view and hear a concert, and internet network video live broadcast is realized.

Under the condition of good network state, the audio and video data packets sent by the sender can be quickly received, so that the receiver can buffer and play the audio and video data in time no matter the sending priority of the data packets, and the playing pictures and sound of the receiver are continuous.

Under the condition that the network state is not good, a sender sends out audio and video data packets according to the sending priority, a receiver buffers the audio and video data packets at the same time, under the same scene, the sending priority of the audio data packets is high, the audio data packets containing important data are sent preferentially, and the receiver can buffer the audio data first, so that the audio data can be played first, and the interference generated by network delay is reduced. Under different scenes, the video data packet is high in transmission priority, the video data packet containing important data is transmitted preferentially, and the receiver can buffer the video data first, so that the video data can be played first, and the interference generated by network delay is reduced. When the network is recovered to a good state, for a sender, the sender still sends out data packets according to the sending priority according to the identification of the scene, for a receiver, the buffering speed becomes high due to the good network environment, and the receiver generates sound and picture asynchronization due to the bad network state before, at this time, the part of the audio and video which is played backward due to the non-buffering completion can be cut off, and the audio and video content after synchronous playing is directly started from the playing point before the other party.

In one embodiment, the video frame and the audio frame acquired in step 100 are obtained by encoding audio and video recorded in real time.

The respective video frames and audio frames, which are raw data, are usually encoded from an audio-video file. The audio and video files are recorded in real time on site through recording equipment such as a camera. The recording party starts live broadcasting the scene picture and sound of the concert when the concert starts, the real-time recorded file is immediately encoded to obtain a video frame and an audio frame, and the subsequent scene change degree is judged in the encoding process.

When the scene change of the two frames of images is determined in step 100, the scene change determination of the current frame of image and the previous frame of image is performed in the encoding process.

In one embodiment, the encoding mode of the audio and video is determined according to the network state.

The coding mode comprises parameters such as video code rate, video resolution, video frame rate, audio code rate, audio sampling rate and the like.

The video coding rate is the number of data bits transmitted per unit time during data transmission, i.e. the sampling rate, and has a unit of kbps (kilobits per second). The larger the sampling rate in unit time is, the higher the precision is, the larger the data volume is, and the closer the processed file is to the original file. The audio rate is the same principle as the video rate, which is the sampling rate of the audio.

Resolution is a parameter used to measure how much data is in an image, and is usually expressed as ppi (Pixel per inch), with the higher the resolution, the sharper the video, and the larger the amount of data.

The Frame rate (Frame rate) is a measure for measuring the number of display Frames, and is expressed in FPS (Frames per Second). The higher the frame rate, the smoother and more vivid the picture, and the lower the frame rate, the picture will be stuck.

The sampling rate is how many signal samples are collected per second, and the higher the sampling frequency is, the more sample data are obtained in unit time, the more accurate the representation of the signal waveform is, and the more restored the tone quality and the tone of the audio are, the larger the data volume is.

It can be understood that when the network state is not good, the code rate, resolution, frame rate, sampling rate, etc. adopted during encoding can be set relatively low to compress and reduce the data volume, so that the receiving party can accurately receive the live broadcast content and simultaneously reduce the requirement on the network speed to ensure the normal transmission of the data. When the network state is good, the code rate, resolution, frame rate, sampling rate and the like adopted in the encoding process can be set to be normal or higher, so that the viewing experience of a viewer is improved.

In one embodiment, whether the scene change degree exceeds a preset scene change threshold is judged through an inter-frame difference method and/or a background difference method.

When determining whether there is a scene change in step 100, it may be determined whether there is a scene change in the two video frame images by an inter-frame difference method, or whether there is a scene change in the two video frame images by a background difference method, or it may be determined whether there is a scene change in the two video frame images by an inter-frame difference method and a background difference method at the same time to perform a double-layer determination, thereby further increasing the accuracy of the determination result.

The interframe difference method is to subtract pixel values of two adjacent frames or two images separated by several frames of images in a video stream to obtain an absolute value of the brightness difference of the two frames of images, and determine whether scenes of the two frames have significant changes by judging whether the absolute value is greater than a preset threshold value.

The background difference method is a method for detecting a moving object by comparing a current frame in an image sequence with a background reference model, and comprises the steps of carrying out difference operation on the current acquired image frame and the background image to obtain a gray scale image of a target motion region, carrying out thresholding on the gray scale image, taking an absolute value, and determining whether scenes of two frames have obvious changes or not by judging whether the absolute value is greater than a preset threshold value or not. Wherein the background image is updated according to the currently acquired image frame.

In one embodiment, the buffer size at the receiver is variable, which is adjusted according to the scene markers.

The receiving party comprises a receiving buffer pool, and the sizes of a video receiving buffer area and an audio receiving buffer area of the receiving buffer pool can be set to be invariable or variable. If the video receiving buffer and the audio receiving buffer are set to be variable in size, the basis for the change is the scene mark added to the data packet in step 200.

During the process of buffering and decoding the received audio and video data packets, the scene markers carried by the audio and video data packets are detected. If the scene marks carried by the data packets are detected to be the same, the scene marks represent that the scene of the two frames of video frames is not obviously changed, and at the moment, the buffering capacity of the receiving buffer pool can be properly reduced, which is equivalent to reducing the buffering time of the audio and video data, so that the received audio and video data can be played as soon as possible. Under the condition of bad network condition, the occupied space of the audio data packet is relatively small, and the quality of the video picture can be properly allowed to be poor under the same scene, so that the buffer capacity of the receiving buffer pool is properly reduced, the audio content is played as soon as possible, and the pause phenomenon in the live broadcasting process is reduced.

If the scene marks carried by the data packets are detected to be different, the scene change of the two frames of video frames is shown, at this time, the buffering capacity of the receiving buffer pool can be properly increased, which is equivalent to increasing the buffering time of the audio and video data, so that the received audio and video data is played after being buffered for a period. Under the condition of bad network conditions, because the occupied space of the video data packet is relatively large, the video needs to be buffered for a period of time and then played, so that the buffer capacity of the receiving buffer pool is properly increased, the situation that the video is immediately played after the buffering is finished for only one second is prevented, and the pause in the live broadcasting process is reduced.

In one embodiment, after sending a data packet, the sender waits for an acknowledgement signal fed back by the receiver, deletes the corresponding data packet in the buffer after receiving the acknowledgement signal, and automatically resends the data packet when the sender does not receive the acknowledgement signal for more than a set time.

After a sending buffer pool of a sender sends a data packet, if a receiver receives the data packet, a confirmation signal is fed back to the sender to indicate that the receiver receives the data packet, the sender can delete the data packet in the sending buffer pool, but if the data packet is lost, the sender judges that the sent data packet is lost when the sender does not receive the confirmation signal for more than a set time, and automatically sends the data packet again to try again to enable the receiver to receive the data packet.

A second embodiment of the media low-latency communication method for network video live broadcast provided by the present invention is described in detail below with reference to fig. 2. The embodiment can judge the video content in the audio and video containing the live network broadcast content sent by the sender to adjust the priority of video transmission and audio transmission, so as to preferentially send important data when the network communication is delayed or unstable, reduce the interference generated by network delay and ensure that the receiver can not interrupt and leak the live network broadcast key content to the greatest extent; meanwhile, the time identification is added in the audio and video data to limit the suspension time of the data packet which is suspended to be sent, so that the data packet with low priority is prevented from lagging too much.

As shown in fig. 2, the medium low-latency communication method provided in this embodiment includes the following steps:

step 100, judging whether the scene change degree between the obtained current video frame image and the previous video frame image exceeds a preset scene change threshold value, and adding corresponding time identifiers in each video frame and each audio frame.

In the process of obtaining the video frame and the audio frame by encoding the audio and video, a same scene judgment mechanism is applied to the video frame, namely, the video frame Vn which needs to be sent to a receiving party at present and the image of the previous video frame Vn-1 are identified, the scene change degree of the two adjacent frames of images is obtained, and whether the scene change degree Sn exceeds a preset scene change threshold value St is judged. If the scene change degree Sn exceeds St, it indicates that the video frame Vn and the video frame Vn-1 have a significant change in the scene. If the scene change degree Sn does not exceed St, it indicates that the video frame Vn and the video frame Vn-1 have not significantly changed in the scene.

It should be noted that, the comparison of the scene change degree with the current video frame image may be performed not only with the previous video frame image but also with the previous nth video frame image, N >1, i.e., with the previous second frame image, the previous third frame image, etc., the frame skipping is performed for comparison, the specific value of N may be set according to the actual situation, but is not set too large, otherwise, the situation that the scene change changes back within a short period of time may be missed.

Because the time for acquiring each frame of video frame is different, time identifiers are added to the video frames according to the acquisition time or clock signals, and the time contained in the time identifiers of the video frames is different. The audio frames have the same time identifiers as the synchronized video frames, so the time identifiers of the audio frames have different time instants. In particular, the time identification may be a time stamp. A timestamp (timestamp) is a complete, verifiable piece of data, usually a sequence of characters, that can represent a piece of data that existed before a particular time, uniquely identifying the time of the moment.

It can be understood that, time marks may also be added to the video frames and the audio frames, and then it is determined whether the scene change degree of the two video frame images exceeds the scene change threshold, and the execution sequence of the two frames has no mandatory requirement.

After judging whether the pictures of the video frame Vn and the video frame Vn-1 have significant changes or not, respectively performing data sub-packaging on the video frame Vn and the audio frame An corresponding to the video frame Vn.

After the video frame Vn and the audio frame An are subjected to data sub-packaging, a plurality of video data packets and a plurality of audio data packets are obtained, and scene marks are added in each video data packet and each audio data packet. If the scene of the video frame Vn has a significant change with respect to the video frame Vn-1, the scene tag added to the packet of the video frame Vn is different from the scene tag added to the packet of the video frame Vn-1, and if the scene of the video frame Vn has no significant change with respect to the video frame Vn-1, the same scene tag is added to the packet of the video frame Vn and the packet of the video frame Vn-1. The scene mark of the audio frame is from the video frame, and the scene mark of the video frame and the scene mark of the audio frame at the same time are the same.

It should be noted that, after the video frame and the audio frame carrying the time identifier are packetized, each data packet still carries the time identifier.

After the video frame Vn and the audio frame An are respectively subjected to data packetization, the video data packet is sent to a video sending buffer area of a sending buffer pool, and the audio data packet is sent to An audio sending buffer area of the sending buffer pool.

After the video data packet and the audio data packet are sent to the sending buffer pool, the scene mark of the data packet of the current video frame and the scene mark of the data packet of the previous video frame need to be detected, if the scene marks are the same, the video frame to which the current video data packet belongs and the video frame to which the previous video data packet belongs are the same scene, and if the scene marks are different, the scenes are different.

And under the condition that the current video frame and the previous video frame are in the same scene, adjusting the sending priority of the audio data packet to be higher than that of the video data packet, so that the bandwidth occupied by the audio data packet is properly increased in the bandwidth allocation of the sending data. Because the sending proportion of the audio data packet is improved, the proportion of the audio data packet received by a receiving party (a watching party) is also improved, and the audio data is buffered and played in advance compared with the video data, so that the singing voice of a singer can be kept continuous as much as possible during network live broadcasting, such as live broadcasting of a concert, and the receiving party can be ensured to have no interruption and no leakage of important singing contents to the greatest extent.

And under the condition that the current video frame and the previous video frame are different scenes, adjusting the sending priority of the video data packet to be higher than or equal to the sending priority of the audio data packet, so that the bandwidth occupied by the video data packet is properly increased in the bandwidth allocation of the sending data. Because the sending proportion of the video data packet is improved, the proportion of the video data packet received by a receiver (a viewer) is also improved, and the video data is buffered and played in advance compared with the audio data, so that the pictures of the concert can be kept as continuous as possible, and the receiver can be ensured to have no interruption and no leakage of important singing contents to the greatest extent.

And 400, buffering the received video data packet and audio data packet by the receiver according to the scene mark to realize audio and video data playing, and limiting the time length of the data with high sending priority to be ahead of the time length of the data with low sending priority to be played according to the time identifier carried by each data packet and a preset time length threshold.

The method comprises the steps that a sender sends data which are buffered in a sending buffer pool, a receiver receives a video data packet and an audio data packet sent by the sender, identifies and buffers the data packets according to scene marks, packet headers and other data carried by the data packets, and simultaneously, after the receiving buffer pool receives the data packets, the scene marks of the data packets of the current video frame and the scene marks of the data packets of the previous video frame are detected to judge whether adjacent video frames belong to the same scene. If the scene marks are the same, the current video frame and the previous video frame are the same scene, and if the scene marks are different, the current video frame and the previous video frame are different scenes. After the data buffering is finished, the time difference of audio and video playing needs to be judged, and the situation that the audio or the video is too much ahead of the other party is prevented. And after the data buffering is finished, decoding and playing the audio and video contents to realize the purpose of network video live broadcast.

When a sender sends a data packet with high priority preferentially, for example, when the sender sends an audio data packet with continuous priority, if the network condition is not good, the audio played by a receiver will lead the video, and if the network condition is in a bad state for a long time, in order to avoid that the played audio leads the video too much, before playing the audio, the time identifier of the audio data packet needs to be detected, if the time identifier of the audio data packet to be played in the receiving buffer pool is Tn, and the time identifier of the video data packet is also Tn, it indicates that the audio and video data packets are played synchronously, and there is no time difference between the audio and the video.

If the time stamp of the audio data packet to be played in the receiving buffer pool is Tn +10 and the time stamp of the video data packet is also Tn, the audio is advanced by the time difference of 10 units of the video, which is equivalent to being advanced by 10 frames. At this time, the advanced frame number needs to be compared with a preset time threshold, if the time threshold is 20, the advanced frame number does not exceed the time threshold, so that the advanced audio is allowed to be played, and the video can only be played in a stuck state due to the unsmooth transmission of the data packet in the network. If the time length threshold is 10 frames, the number of advanced frames reaches the limit of the time length threshold, and at this time, the audio cannot be continuously advanced and played without playing the video, that is, the time difference of the audio and video asynchronism cannot be enlarged, but the content of the video behind the audio part needs to be abandoned, and when the video data is buffered to Tn +10, the audio and the video are both directly played from the position with the time mark Tn + 10.

If the receiver detects that the scene mark of the received audio/video data packet changes during the audio advance playing process, it indicates that the scene of the frame is different from the previous frame, at this time, the video playing may catch up with a part of the audio advanced part, and the scene after the video playing does not change any more, so the audio may continue to increase the length ahead of the video.

If the network state is changed from bad to good and the audio and video data packets can be normally received and smoothly buffered when the audio is played before the video and the time length threshold is not exceeded, the content of the video behind the audio part can be abandoned and the audio and video data with the same time identification can be synchronously played, and the time difference between the audio and the video is zero.

In one embodiment, the duration threshold is variable, which is adjusted according to the scene markers.

The time length threshold used for judging that the advance amount of the data packet played in advance exceeds the set value is adjustable, and the adjustment is specifically carried out according to the scene mark of the received data packet.

If the scene mark of the current video frame data packet received by the receiving buffer pool is the same as the scene mark of the previous video frame data packet, that is, no scene change occurs, at this time, the sender will send more audio data, and the time length threshold value can be properly increased, so that the receiving buffer pool allows more audio to be played in advance when the complete video frame is not received in the late stage.

If the scene mark of the current video frame data packet received by the receiving buffer pool is different from the scene mark of the previous video frame data packet, namely the scene change occurs, at this moment, the sender can send more video data, and the time length threshold value can be properly reduced, so that the audio frequency does not exceed the video frequency too much, and the audio and the video can be played synchronously as much as possible.

The respective video frames and audio frames, which are raw data, are usually encoded from an audio-video file. The audio and video files are recorded in real time on site through recording equipment such as a camera.

The coding mode comprises parameters such as video code rate, video resolution, video frame rate, audio code rate, audio sampling rate and the like. It can be understood that when the network state is not good, the code rate, resolution, frame rate, sampling rate, etc. adopted during encoding can be set relatively low to compress and reduce the data volume, so that the receiving party can accurately receive the network live broadcast content and simultaneously reduce the requirement on the network speed to ensure the normal transmission of the data. When the network state is good, the code rate, resolution, frame rate, sampling rate and the like adopted in the encoding process can be set to be normal or higher, so that the viewing experience of a viewer is improved.

During the process of buffering and decoding the received audio and video data packets, the scene markers carried by the audio and video data packets are detected. If the scene marks carried by the data packets are detected to be the same, the scene marks indicate that the scene of the two frames of video frames is not obviously changed, at the moment, the buffer capacity of the receiving buffer pool can be properly reduced, under the condition that the network condition is not good, the audio content can be played as soon as possible by properly reducing the buffer capacity of the receiving buffer pool, and the pause phenomenon in the live broadcast process is reduced.

If the scene marks carried by the data packets are detected to be different, the scene marks of the two frames of video frames are obviously changed, the buffer capacity of the receiving buffer pool can be properly increased, and under the condition of bad network conditions, the buffer capacity of the receiving buffer pool is properly increased, so that the situation that the video which is buffered completely and is played for less than one second can be prevented from being played immediately, and the pause feeling in the live broadcasting process is reduced.

A first embodiment of the media low-latency communication system for webcast provided by the present invention is described in detail below with reference to fig. 3. The system is mainly applied to network video live broadcast, and can judge the video content in the audio and video containing the network live broadcast content sent by a sender to adjust the priority of video transmission and audio transmission so as to send important data preferentially when network communication is delayed or unstable, reduce interference generated by network delay and ensure that a receiver can not interrupt and leak the key content of the network live broadcast to the greatest extent.

As shown in fig. 3, the media low-latency communication system provided by this embodiment includes:

and the scene judging module is used for judging whether the acquired scene change degree between the current video frame image and the previous video frame image exceeds a preset scene change threshold value or not. And if the scene change threshold is exceeded, the two frames of scenes are judged to be different, and if the scene change threshold is not exceeded, the two frames of scenes are judged to be the same.

And the data sub-packaging module is used for respectively performing data sub-packaging on the current video frame judged by the scene judgment module and the acquired corresponding audio frame.

And the scene identification module is connected with the scene judgment module and is used for adding corresponding scene marks in each video data packet and each audio data packet according to the judgment result of whether the scene change degree exceeds a preset scene change threshold value after data packetization. If the scene judging module judges that the scenes of the two frames are the same, the scene marks added in the data packet are the same, and if the scene judging module judges that the scenes of the two frames are different, the scene marks added in the data packet are different.

And the sending buffer module is used for buffering the video data packet and the audio data packet after the corresponding scene mark is added, adjusting the sending priority of the video data packet and the audio data packet according to the scene mark, and sending the video data packet and the audio data packet according to the sending priority. The transmission buffer module includes a transmission buffer pool for buffering the video data packets and the audio data packets. If the scene marks of the two frames are the same, the priority of the audio data is judged to be high, the proportion of the audio data in the sending bandwidth of the sending buffer module is increased at the moment, and the sending quantity of the audio data packets is relatively increased when the data packets are sent. If the scene marks of the two frames are different, the priority of the video data is judged to be high, the ratio of the video data in the transmission bandwidth of the transmission buffer module is increased at the moment, and the transmission quantity of the audio data packet is relatively increased when the data packet is transmitted.

And the receiving buffer module is connected with the playing end of the receiving party and used for receiving the video data packet and the audio data packet sent by the sending buffer module and buffering the received video data packet and the received audio data packet according to the scene mark so as to realize the playing of the audio and video data. The receive buffer module includes a receive buffer pool for buffering the video data packets and the audio data packets. If the priority of the audio data packet is high, the receiving buffer module completes the buffering in advance and plays the audio in advance, so that the audio is preferentially played when the network condition is poor and the scene is the same. If the priority of the video data packet is high, the receiving buffer module finishes buffering in advance and plays the video in advance, so that the video is played preferentially when the network condition is poor and different scenes exist.

In fig. 3, the upper part of the dotted line is the transmission side, and the lower part of the dotted line is the reception side.

In one embodiment, the system further comprises: and the coding and recording module is connected with the video recording equipment and is used for recording the audio and video in real time and coding the audio and video to obtain the acquired video frame and the acquired audio frame.

In one embodiment, the scene determination module includes: the device comprises a first judging unit and/or a second judging unit, wherein the first judging unit is used for judging whether the scene change degree exceeds a preset scene change threshold value through an interframe difference method, and the second judging unit is used for judging whether the scene change degree exceeds the preset scene change threshold value through a background difference method.

In one embodiment, the transmit buffer module comprises: and the first adjusting unit is used for adjusting the buffer size of the sending buffer module according to the scene mark. The first adjusting unit is connected with the sending buffer pool. When the two frames of video frames are in the same scene, the first adjusting unit can appropriately adjust the buffer size of the sending buffer module to be small, and when the two frames of video frames are in different scenes, the first adjusting unit can appropriately adjust the buffer size of the sending buffer module to be large.

A second embodiment of the media low-latency communication system for webcast provided by the present invention is described in detail below with reference to fig. 4. The system can judge the video content in the audio and video containing the live network broadcast content sent by the sender to adjust the priority of video transmission and audio transmission, so as to send important data preferentially when the network communication is delayed or unstable, reduce the interference generated by network delay, and ensure that the receiver can not interrupt and leak the live network broadcast key content to the greatest extent. Meanwhile, the time identification is added in the audio and video data to limit the suspension time of the data packet which is suspended to be sent, so that the data packet with low priority is prevented from lagging too much.

As shown in fig. 4, the media low-latency communication system provided by this embodiment includes:

And the time identification module is used for adding corresponding time identifications in each video frame and each audio frame before data subpackaging, and the time identifications can adopt time stamps.

And the receiving buffer module is connected with the playing end of the receiving party and used for receiving the video data packet and the audio data packet sent by the sending buffer module and buffering the received video data packet and the received audio data packet according to the scene mark so as to realize the playing of the audio and video data. The receive buffer module includes a receive buffer pool for buffering the video data packets and the audio data packets.

And the time difference adjusting module is connected with the receiving buffer module and is used for limiting the data with high sending priority to be ahead of the playing time of the data with low sending priority according to the time identifier carried by each data packet and a preset time threshold after the receiving buffer module buffers the data packet, so as to prevent the data packet with low priority from lagging too much.

In fig. 4, the upper part of the dotted line is the transmission side, and the lower part of the dotted line is the reception side.

In one embodiment, the time duration threshold is variable, and the time difference adjustment module adjusts the time duration threshold according to the scene markers. When the two frames of video frames in the receiving buffer module are in the same scene, the time difference adjusting module can properly adjust the time length threshold value to be larger, and when the two frames of video frames in the receiving buffer module are in different scenes, the time difference adjusting module can properly adjust the time length threshold value to be smaller.

The specific settings of the scene determination module, the data packetization module, the scene identification module, the sending buffer module, the receiving buffer module, the encoding recording module, and the like in this embodiment can refer to the structural settings described in the first embodiment of the system, and are not described in detail any more.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A media low-delay communication method for network video live broadcast, comprising:

respectively performing data sub-packaging on the current video frame and the acquired corresponding audio frame, and adding corresponding scene marks in each video data packet and each audio data packet according to a judgment result of whether the scene change degree exceeds a preset scene change threshold, wherein when the scene change degree exceeds the preset scene change threshold, the scene marks added in the data packets of the current video frame and the audio frame are different from the scene marks added in the data packets of the previous video frame and the audio frame, and when the scene change degree does not exceed the preset scene change threshold, the scene marks added in the data packets of the current video frame and the audio frame are the same as the scene marks added in the data packets of the previous video frame and the audio frame;

buffering the video data packets and the audio data packets, adjusting transmission priorities of the video data packets and the audio data packets according to the scene tags, and transmitting the video data packets and the audio data packets according to the transmission priorities, wherein the transmission priority of the audio data packets is adjusted to be higher than the transmission priority of the video data packets when the scene tags of a current video frame and a previous video frame are the same, and the transmission priority of the video data packets is adjusted to be higher than the transmission priority of the audio data packets when the scene tags of the current video frame and the previous video frame are different;

2. The low-latency communication method according to claim 1, wherein the acquired video frames and the acquired audio frames are encoded by real-time recorded audio and video.

3. The low latency communication method according to claim 1, wherein whether the scene change degree exceeds the preset scene change threshold is determined by an inter-frame difference method and/or a background difference method.

4. A low latency communication method as claimed in claim 1, wherein a buffer size of a receiving side is variable, which is adjusted according to the scene mark.

5. A low-latency communication method according to any one of claims 1 to 4, wherein before the separately packetizing the current video frame and the obtained corresponding audio frame, further comprising: adding corresponding time marks in each video frame and each audio frame; and the number of the first and second electrodes,

6. A low latency communication method as claimed in claim 5, wherein the duration threshold is variable, adjusted in dependence on the scene markers.

7. A media low-delay communication system for webcast, comprising:

a scene identification module, configured to add, after the data packetization, a corresponding scene flag in each of the video data packet and the audio data packet according to a determination result of whether the scene change degree exceeds a preset scene change threshold, where when the scene change degree exceeds the preset scene change threshold, the scene flag added in the data packet of the current video frame and the audio frame is different from the scene flag added in the data packet of the previous video frame and the audio frame, and when the scene change degree does not exceed the preset scene change threshold, the scene flag added in the data packet of the current video frame and the audio frame is the same as the scene flag added in the data packet of the previous video frame and the audio frame;

a sending buffer module, configured to buffer the video data packet and the audio data packet after the corresponding scene tag is added, adjust sending priorities of the video data packet and the audio data packet according to the scene tag, and send the video data packet and the audio data packet according to the sending priorities, where the sending priority of the audio data packet is adjusted to be higher than the sending priority of the video data packet when the scene tags of the current video frame and the previous video frame are the same, and the sending priority of the video data packet is adjusted to be higher than the sending priority of the audio data packet when the scene tags of the current video frame and the previous video frame are different;

8. A low latency communication system as recited in claim 7, wherein the system further comprises:

9. The low-latency communication system of claim 7, wherein the scene decision module comprises: the device comprises a first judging unit and/or a second judging unit, wherein the first judging unit is used for judging whether the scene change degree exceeds a preset scene change threshold value through an inter-frame difference method, and the second judging unit is used for judging whether the scene change degree exceeds the preset scene change threshold value through a background difference method.

10. A low-latency communication system as recited in claim 7, wherein the transmit buffer module comprises: and the first adjusting unit is used for adjusting the buffer size of the sending buffer module according to the scene mark.