CN115002484A

CN115002484A - Video encoding and decoding method for reducing time delay, video conference system and storage medium

Info

Publication number: CN115002484A
Application number: CN202210467485.XA
Authority: CN
Inventors: 吴剑
Original assignee: Youmi Technology Shenzhen Co ltd
Current assignee: Youmi Technology Shenzhen Co ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-09-02

Abstract

The invention discloses a video coding and decoding method for reducing time delay, a video conference system and a readable storage medium thereof, wherein an H.264 coding standard is adopted to divide an image frame into I, B, P frames, each I frame is divided into strips, and each I frame is coded by taking the strips as a unit; the band is coded and compressed by adopting H.264 intra-frame layered prediction, namely, the prediction result of the next level or more is taken as the prediction reference during the prediction coding; and after the acquisition of each strip image is finished, encoding and transmitting the strip images, and synchronously decoding the strip images after the strip images reach a decoder. The reduction of the average prediction distance can effectively reduce the data size of the standard prediction method, improve the prediction performance of H.264, and enable the prediction coding to be completed in shorter time so as to realize the reduction of delay.

Description

Video encoding and decoding method for reducing time delay, video conference system and storage medium

Technical Field

The present application relates to the technical field of video conference systems, and in particular, to a video encoding and decoding method, a video conference system, and a storage medium for reducing latency.

Background

The videoconferencing system, which is a personal or group in two or more different places, distributes various data such as static and dynamic images, voice, text, pictures and the like of people to computers of various users through various existing telecommunication transmission media, so that the geographically dispersed users can share one place to exchange information in various ways such as graphics, voice and the like, thereby enhancing the comprehension ability of the two parties to the content. At present, the video conference gradually develops towards the direction of multi-network cooperation, high-definition and development. The video conference can realize high-efficiency and high-definition remote conference and office by means of the Internet, has unique advantages in the aspects of continuously improving the communication efficiency of users, reducing the travel cost of enterprises, improving the management effect and the like, has partially replaced business trips, and becomes the latest mode of remote office. The video conference system comprises an MCU multipoint controller (video conference server), a conference room terminal, a PC desktop terminal, a telephone access gateway (PSTNgateway), a gateway and the like. Various different terminals are connected to the MCU for centralized exchange to form a video conference network.

According to the implementation technology, the current video conference system can be divided into three types, firstly, the video conference system based on the hardware type has ideal effect and stable and reliable quality, but the cost of hardware equipment is low, the use is not flexible enough, and only a few large enterprises can use the video conference system in the past. Secondly, based on a software type, the relatively hard system is lower in cost and more flexible and expandable, but the conference system is based on a C/S (Client/Server) architecture, before the conference system is used, a network and a Server need to be configured, a Client needs to be downloaded and installed, and the Client needs to be upgraded irregularly. Thirdly, based on the Web type, a B/S (Browser/Server) framework is adopted, a user does not need to download and install a client, as long as a used computer or mobile terminal equipment is provided with a Browser, a corresponding webpage is opened through the Browser, the user can carry out a video conference in a corresponding room, and the client does not need to be upgraded and maintained.

In a scene applied by the video conference system, a video object needs to be subjected to dumb language translation through limbs, and the limbs have rich and consistent actions, so that the video conference system is required to have ultralow delay and stable fluency. The current video conference system has relatively rich selection, and the delay of the existing open source video conference system is basically between 280 and 500ms, so that the existing open source video conference system is difficult to meet the scene requirement.

Disclosure of Invention

The invention mainly aims to provide a video coding and decoding method for reducing time delay, a video conference system and a storage medium, and aims to solve the technical problem of higher delay of the conventional video conference system.

In order to solve the above technical problem, the present invention provides a video encoding and decoding method for reducing delay, wherein the method comprises:

dividing an image frame into I, B, P frames by adopting an H.264 coding standard, dividing each I frame into individual stripes, and coding each I frame by taking the stripes as a unit;

the band is coded and compressed by adopting H.264 intra-frame layered prediction, namely, the prediction result of the next level or more is taken as the prediction reference during the prediction coding;

and after the acquisition of each strip image is finished, encoding and transmitting the strip images, and synchronously decoding the strip images after the strip images reach a decoder.

Further, the dividing each I frame into individual slices, and encoding each I frame in units of slices includes:

only one stripe in each frame is subjected to predictive coding compression by adopting an intra-frame prediction algorithm, and the rest stripes are subjected to predictive coding compression by adopting an inter-frame prediction algorithm.

Further, the encoding and compressing the stripe by using h.264 intra-frame layered prediction, that is, when performing prediction encoding, using the prediction result of the next stage or more as a prediction reference, includes:

and carrying out intra-frame prediction on the prediction pixels at the fixed positions according to the reference pixels, and after the prediction encoding of the previous-stage prediction pixel is completed, changing the previous-stage prediction pixels into the reference pixels to provide reference for the non-encoded pixels adjacent to the reference pixels so as to carry out intra-frame prediction.

if the current coded basic block is 2Nx2N (N is 2,4,8,16 …), Di, j represents the distance between the pixel located at (i, j) and the reference pixel;

drive represents the predicted distance of each row of pixels, Dtotal represents the total predicted distance of the data block; the total predicted distance Dtotal is equal to the sum of the predicted distances Drow of each row;

setting an average predicted distance Dpred, which means adding all the distances between the current all pixels and the pixel to be referred to and then taking an average value.

The prediction pixels are divided into two stages, wherein one pixel point is selected as a first-stage prediction pixel at intervals of one row and one column, the total first-stage pixel prediction distance is equal to the sum of the first-stage pixel distances of each row where the first-stage pixels are located, and the summation formula is as follows:

the prediction distance of the first-level predicted pixel of each line should be 1,3,5, …, 2N-1, respectively, and then the prediction distance Drow _ j of this line should be the sum of each point;

the second-stage prediction distance Dlevel _2 is equal to the sum of the prediction distances of 3 second-stage prediction points corresponding to 2N first-stage prediction points respectively, and is expressed by a formula as follows:

D _{level_2} ＝2N×3＝6N；

therefore, the prediction distance dtotai _ b of the entire prediction pixel block using this hierarchical prediction method is represented by the following equation:

D _{total_b} ＝D _{level_1} +D _{level_2} ＝N ³ +6N；

the average predicted distance Dpred _ b is given by:

D _{pred_b} ＝D _{total_b} /(2N×2N)＝(N ³ +6N)/4N ² 。

the reduction of the average prediction distance can effectively reduce the data size of the standard prediction method, improve the prediction performance of H.264, and enable the prediction coding to be completed in shorter time so as to realize the reduction of delay.

Based on the same inventive concept, in another aspect of the present invention, there is provided a video conference system, the system at least comprising:

the system comprises at least one client, a plurality of clients and a server, wherein a P2P full connection architecture is adopted among the clients; a server connected to and interacting with each client, wherein a computer program is stored and run on the server and the client, and when executed by the server and the client, the computer program implements the following steps:

dividing an image frame into I, B, P frames by adopting an H.264 coding standard, dividing each I frame into a plurality of strips, and coding each I frame by taking the strips as a unit;

Specifically, the server side comprises a web server and a signaling server, and the web server is used for responding to pages, registration, login and various function requests in the system of the client side; and the signaling server is used for providing the information exchange function of the client.

Further, the server further comprises an NAT traversal server for implementing data interaction between the web server and the signaling server between different local area networks through the internet.

Optionally, the web server is an SFU server, and when the number of the clients reaches a preset threshold, an SFU architecture is adopted between the plurality of clients and the SFU server.

Based on the same inventive concept, in another aspect of the present invention, a computer-readable storage medium is provided, on which a reduced latency video codec program is stored, which when executed by a processor implements the steps of the reduced latency video codec method as claimed in the appended claims.

The technical scheme of the invention has the beneficial effects that:

the video coding and decoding method, the video conference system and the storage medium for reducing the time delay apply a method of dividing and coding frames in the original coding and decoding algorithm, so that the frames are not coded after one frame of image is collected in a standard algorithm any more, but each frame is divided into individual strips, and then the strips are coded by taking the strips as units. Since the data amount of the stripe itself is much smaller than that of one frame image, each stripe will not cause the too large difference of the data amount of the I frame and the P frame after being encoded in units of frames. The data volume can be guaranteed to be stable in a relatively small range, and then the encoding buffer area can be correspondingly reduced. Each strip image is transmitted after being coded, and after the strip images reach a decoder, the decoder can decode the data by buffering a plurality of strip images without buffering a plurality of frames of images in an image standard algorithm as reference for decoding. After the coding is divided into the stripes, the size of the buffer area can be reduced by more than half compared with the original algorithm, and the time delay caused by the need of coding and decoding buffering is effectively reduced.

Drawings

Fig. 1 is a schematic hardware configuration diagram of a mobile terminal implementing various embodiments of the present invention;

fig. 2 is a diagram of a communication network architecture according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for reducing latency in video encoding and decoding according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a GOP sequence structure of seven frame pictures provided by the embodiment of the invention;

FIG. 5 is a diagram of an encoding framework for a slice-partitioned frame according to an embodiment of the present invention;

FIG. 6 is a diagram of a standard H.264 intra-frame horizontal predictive coding structure according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a hierarchical predictive coding structure according to an embodiment of the present invention;

fig. 8 is a block diagram of a hardware structure of a first video codec system with reduced delay according to an embodiment of the present invention;

fig. 9 is a block diagram of a second hardware architecture of a video codec system with reduced latency according to an embodiment of the present invention;

fig. 10 is a diagram of a Mesh network topology provided by an embodiment of the present invention;

FIG. 11 is a diagram of an SFU network topology provided by an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in itself. Thus, "module", "component" or "unit" may be used mixedly.

The terminal may be implemented in various forms, for example, the terminal described in the present invention may include a mobile terminal such as a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a Personal Digital Assistant (PDA), a Portable Media Player (PMP), a navigation device, a wearable device, a smart band, a pedometer, and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like.

The following description will be given by way of example of a mobile terminal, and it will be understood by those skilled in the art that the construction according to the embodiment of the present invention can be applied to a fixed type terminal, in addition to elements particularly used for mobile purposes.

Referring to fig. 1, which is a schematic diagram of a hardware structure of a mobile terminal for implementing various embodiments of the present invention, the mobile terminal 100 may include: RF (radio frequency) unit 101, WiFi module 102, audio output unit 103, a/V (audio/video) input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, processor 110, and power supply 111. Those skilled in the art will appreciate that the mobile terminal architecture shown in fig. 1 is not intended to be limiting of mobile terminals, which may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile terminal in detail with reference to fig. 1:

the radio frequency unit 101 may be configured to receive and transmit signals during information transmission and reception or during a call, and specifically, receive downlink information of a base station and then process the downlink information to the processor 110; in addition, the uplink data is transmitted to the base station. Typically, radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 101 can also communicate with the network and other devices through quick payment of public transportation fees. The above-mentioned rapid payment of public transportation cost may use any communication standard or protocol, including but not limited to GSM (global system for mobile communications), GPRS (general packet radio service), CDMA2000(code division multiple access 2000), WCDMA (wideband code division multiple access), TD-SCDMA (time division-synchronous code division multiple access), FDD-LTE (frequency division duplex-long term evolution), TDD-LTE (time division duplex-long term evolution), and the like.

WiFi belongs to short-distance wireless transmission technology, and the mobile terminal can help a user to receive and send emails, browse pages, access streaming media and the like through the WiFi module 102, and provides wireless broadband internet access for the user. Although fig. 1 shows the WiFi module 102, it is understood that it does not belong to the essential constitution of the mobile terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The audio output unit 103 may convert audio data received by the radio frequency unit 101 or the WiFi module 102 or stored in the memory 109 into an audio signal and output as sound when the mobile terminal 100 is in a call signal reception mode, a call mode, a recording mode, a voice recognition mode, a broadcast reception mode, or the like. Also, the audio output unit 103 may also provide audio output related to a specific function performed by the mobile terminal 100 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 103 may include a speaker, a buzzer, and the like.

The a/V input unit 104 is used to receive audio or video signals. The a/V input unit 104 may include a Graphics Processor (GPU) 1041 and a microphone 1042, the graphics processor 1041 processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 106. The image frames processed by the graphic processor 1041 may be stored in the memory 109 (or other storage medium) or transmitted via the radio frequency unit 101 or the WiFi module 102. The microphone 1042 may receive sounds (audio data) via the microphone 1042 in a phone call mode, a recording mode, a voice recognition mode, or the like, and may be capable of processing such sounds into audio data. The processed audio (voice) data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 101 in case of a phone call mode. The microphone 1042 may implement various types of noise cancellation (or suppression) algorithms to cancel (or suppress) noise or interference generated in the course of receiving and transmitting audio signals.

The mobile terminal 100 also includes at least one sensor 105, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 1061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 1061 and/or the backlight when the mobile terminal 100 moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, the description is omitted here.

The display unit 106 is used to display information input by a user or information provided to the user. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like.

The user input unit 107 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the mobile terminal. Specifically, the user input unit 107 may include a touch panel 1071 and other input devices 1072. The touch panel 1071, also referred to as a touch screen, may collect a touch operation performed by a user on or near the touch panel 1071 (e.g., an operation performed by the user on or near the touch panel 1071 using a finger, a stylus, or any other suitable object or accessory), and drive a corresponding connection device according to a predetermined program. The touch panel 1071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 110, and can receive and execute commands sent by the processor 110. In addition, the touch panel 1071 may be implemented in various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The user input unit 107 may include other input devices 1072 in addition to the touch panel 1071. In particular, other input devices 1072 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like, without limitation.

Further, the touch panel 1071 may cover the display panel 1061, and when the touch panel 1071 detects a touch operation thereon or nearby, the touch panel 1071 transmits the touch operation to the processor 110 to determine the type of the touch event, and then the processor 110 provides a corresponding visual output on the display panel 1061 according to the type of the touch event. Although the touch panel 1071 and the display panel 1061 are shown in fig. 1 as two separate components to implement the input and output functions of the mobile terminal, in some embodiments, the touch panel 1071 and the display panel 1061 may be integrated to implement the input and output functions of the mobile terminal, and is not limited herein.

The interface unit 108 serves as an interface through which at least one external device is connected to the mobile terminal 100. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 108 may be used to receive input (e.g., data information, power, etc.) from external devices and transmit the received input to one or more elements within the mobile terminal 100 or may be used to transmit data between the mobile terminal 100 and external devices.

The memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 109 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 110 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, and performs various functions of the mobile terminal and processes data by operating or executing software programs and/or modules stored in the memory 109 and calling data stored in the memory 109, thereby performing overall monitoring of the mobile terminal. Processor 110 may include one or more processing units; preferably, the processor 110 may integrate an application processor, which mainly processes an operating system, a user interface, an application program, and the like, and a modem processor, which mainly processes a quick payment of public transportation fees. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The mobile terminal 100 may further include a power supply 111 (e.g., a battery) for supplying power to various components, and preferably, the power supply 111 may be logically connected to the processor 110 via a power management system, so as to manage charging, discharging, and power consumption management functions via the power management system.

Although not shown in fig. 1, the mobile terminal 100 may further include a bluetooth module or the like, which is not described in detail herein.

In order to facilitate understanding of the embodiments of the present invention, a communication network system on which the mobile terminal of the present invention is based is described below.

Referring to fig. 2, fig. 2 is an architecture diagram of a communication network system according to an embodiment of the present invention, where the communication network system is an LTE system of a universal mobile telecommunications technology, and the LTE system includes a UE (user equipment) 201, an E-UTRAN (evolved UMTS terrestrial radio access network) 202, an EPC (evolved packet core) 203, and an IP service 204 of an operator, which are in communication connection in sequence.

Specifically, the UE201 may be the terminal 100 described above, and is not described herein again.

The E-UTRAN202 includes eNodeB2021 and other eNodeBs 2022, among others. Among them, the eNodeB2021 may be connected with other eNodeB2022 through backhaul (e.g., X2 interface), the eNodeB2021 is connected to the EPC203, and the eNodeB2021 may provide the UE201 access to the EPC 203.

The EPC203 may include an MME (mobility management entity) 2031, an HSS (home subscriber server) 2032, other MMEs 2033, an SGW (serving gateway) 2034, a PGW (pdgataway, packet data network gateway) 2035, and a PCRF (policy and charging function entity) 2036, and the like. The MME2031 is a control node that handles signaling between the UE201 and the EPC203, and provides bearer and connection management. HSS2032 is used to provide registers to manage functions such as home location register (not shown) and holds subscriber specific information about service characteristics, data rates, etc. All user data may be sent through SGW2034, PGW2035 may provide IP address assignment for UE201 and other functions, and PCRF2036 is a policy and charging control policy decision point for traffic data flow and IP bearer resources, which selects and provides available policy and charging control decisions for a policy and charging enforcement function (not shown).

IP services 204 may include the internet, intranets, IMS (IP multimedia subsystem), or other IP services, among others.

Although the LTE system is described as an example, it should be understood by those skilled in the art that the present invention is not limited to the LTE system, but may also be applied to other quick payment systems for public transportation fees, such as GSM, CDMA2000, WCDMA, TD-SCDMA, and future new network systems, and the like.

The embodiments of the method of the present invention are proposed based on the hardware structure of the mobile terminal 100 and the communication network system.

Example 1

As shown in fig. 3, an embodiment of the present invention provides a video encoding and decoding method for reducing latency, where the method includes:

s101, dividing an image frame into I, B, P frames by adopting an H.264 coding standard, dividing each I frame into individual strips, and coding each I frame by taking the strips as a unit;

h.264, as the most mainstream video coding and decoding standard, has at least the following advantages: (1) the H.264 realizes the block-level coding no longer based on the frame-level coding; (2) the pixel blocks in the frame are subjected to the processes of spatial prediction, conversion, quantization and entropy coding from the original state, so that the compression of spatial redundant information is realized. (3) By utilizing the motion prediction and motion compensation algorithm, the adjacent frames can simultaneously exist in the buffer area, so that the block data in the frames can provide reference for the subsequent block coding, and the coding only needs to record the difference of different blocks of the adjacent frames, thereby greatly reducing the coded data volume. (4) Some residual blocks may exist in the encoded video frame, and h.264 implements the encoding of these residual blocks in order to better ensure the overall accuracy of the encoding information. However, h.264 sets a buffer to buffer several frames of images for high compression ratio of data and smoothness of playing, which also causes a certain delay in real-time audio/video communication, and in order to further reduce the delay, the delay generated at this stage should be minimized.

The delay generated in the encoding and decoding stages is the main reason of the delay of the video conference system, and the current computer performance is generally better, the delay caused by encoding and decoding operation is very little, and the main delay in the stage is caused by a buffer area. The buffer area is divided into a coding buffer area and a decoding buffer area. The coding and decoding buffer area can temporarily store certain data information, the problem of fluctuation of the current code rate is effectively smoothed, the size of the buffer area directly determines the fluctuation range of the code rate which can be borne by the buffer area, the larger the buffer area is, the more the fluctuation of the code rate can be borne, the better the fault-tolerant capability is, but the larger the buffer area is, the larger the delay caused is.

A piece of video is actually made up of individual image frames. When each frame of image is converted into data information, there is a huge redundancy of the information. Classified according to the manner of redundancy, we generally classify them into spatial redundancy and temporal redundancy. To achieve as much compression of these two redundant information as possible, h.264 encoding first divides the image frame into I, B, P frames.

I frame: also called intra-frame (IntraPicture), only the spatial redundancy information is compressed, and the I frame still carries all the information of the frame image after being encoded, and when decoding and recovering, other frame information is not needed to be referred to, so that the corresponding data size after the I frame is encoded is also the largest one of several frames. Because the existence of the I frame causes that the h.264 has to set a larger buffer, the h.264 optimization method in the technical solution of the present application aims to solve the delay problem caused by the buffer being too large due to the I frame.

P frame: also called forward-predicted frame (Predictive-frame), when encoding, it will refer to not only the data in the current frame, but also the I-frame or B-frame data that has been encoded before, to implement spatial redundancy and temporal redundancy compression. There is typically a large amount of redundant data in adjacent frames and thus the P-frames are more compressed.

B frame: also called Bi-directionally predicted frame (Bi-directionally interpolaredpredictionframe), i.e. the coding of data for a B-frame in a video sequence is done with reference to the information of the frames preceding and following the B-frame. However, B frames are not usually adopted in real-time audio-video communication because they require reference to a subsequent frame, resulting in increased delay.

A GOP refers to a segment of a continuous stream of code starting from and ending at an I frame. The GOP contains parameter sets that record information of the frames to be encoded. When the decoder decodes, it will use I frame as mark to establish parameter set and reference frame data, and whenever I frame is encountered, the former parameter set and data will be emptied and newly created. As shown in fig. 4, a GOP sequence containing seven pictures is shown, and the size of the corresponding data amount in a GOP of a segment can be seen.

Specifically, the dividing each I frame into individual slices, and encoding each I frame in units of slices includes: only one stripe in each frame is subjected to predictive coding compression by adopting an intra-frame prediction algorithm, and the other stripes are subjected to predictive coding compression by adopting an inter-frame prediction algorithm.

As shown in fig. 5, a frame in the original encoding and decoding algorithm is divided and encoded, so that the frame is not encoded after a frame of image is acquired as in the standard algorithm, but each frame is divided into individual slices (Slice) and encoded in units of slices. In the structural algorithm, only one Slice (I-Slice) in each frame adopts an intra-frame prediction algorithm (Slice1, Slice2 and Slice3 … Slice N), and the rest slices (P-slices) adopt an inter-frame prediction algorithm. By adopting the frame structure segmentation algorithm, the encoding is carried out after the acquisition of each strip is finished. Since the data amount of the slice itself is much smaller than that of one frame of image, each slice does not have the condition of too large difference of I frame and P frame data amount after being encoded in units of frames. The data volume can be guaranteed to be stable in a relatively small range, and then the encoding buffer area can be correspondingly reduced.

S102, encoding and compressing the strip by adopting H.264 intra-frame layered prediction, namely, taking a prediction result of a next stage or a previous stage as a prediction reference during predictive encoding;

h.264 intra prediction encoding can effectively prevent the aliasing phenomenon of video, and in the standard h.264 intra encoding process, intra prediction is generally divided into 4x4 luminance component prediction, 16x16 luminance component prediction, and chrominance component prediction.

In the standard intra-frame prediction, assuming that prediction in the horizontal direction is performed, as shown in fig. 6, it can be seen that four levels of pixels, namely, Dpred 1-4 in the graph, are all referenced by black pixels, and then the distances between the four levels of pixels and the reference pixels are 1-4 respectively, the prediction method is nearly exhaustive, the prediction encoding time is too long, and time is very wasted. On the other hand, if the next stage can use the prediction result of the previous stage as the prediction reference, the prediction distance can be greatly reduced by a structure of performing hierarchical prediction on the original prediction pixels. As shown in fig. 7, in this configuration, intra prediction is performed by using reference pixels (black) as prediction pixels (gray points) at fixed positions, and after the gray point prediction encoding is completed, these prediction pixels are changed to reference pixels to provide references for the non-encoded pixels (white points) adjacent to the reference pixels, so that intra prediction is performed. Under this configuration, we can see that most of the predicted points have a significant reduction in distance to the reference point.

Specifically, the encoding compression is performed on the slice by using h.264 intra-frame hierarchical prediction, that is, when performing prediction encoding, the prediction result of the next level or higher as a prediction reference includes:

and performing intra-frame prediction on the prediction pixels at the fixed positions according to the reference pixels, and after the prediction coding of the previous-stage prediction pixel point is completed, changing the previous-stage prediction pixel into the reference pixel to provide reference for the non-coded pixel adjacent to the previous-stage prediction pixel point so as to perform intra-frame prediction.

Specifically, the encoding and compression of the slice by using h.264 intra-frame layered prediction, that is, when performing prediction encoding, using a prediction result of a stage higher than a next stage as a prediction reference includes:

D _{level_2} ＝2N×3＝6N；

D _{tota_b} ＝D _{level_1} +D _{level_2} ＝N ³ +6N；

the average predicted distance Dpred _ b is given by:

D _{pred_b} ＝D _{total_b} /(2N×2N)＝(N ³ +6N)/4N ² 。

the reduction of the average prediction distance can effectively reduce the data size of the standard prediction method, improve the prediction performance of H.264, and enable the prediction coding to be completed in shorter time so as to realize the reduction of delay. And S103, after the acquisition of each strip image is finished, encoding and transmitting the strip images, and synchronously decoding the strip images after the strip images reach a decoder.

Each strip image is transmitted after being coded, and after the strip images reach a decoder, the decoder can decode the data by buffering a plurality of strip images without buffering a plurality of frames of images in an image standard algorithm as reference for decoding. After the coding is divided into the strips, the size of the buffer area can be reduced by more than half compared with the original algorithm, and the time delay caused by the need of coding and decoding buffering is effectively reduced.

Example 2

As shown in fig. 8, an embodiment of the present invention further provides a video encoding and decoding system for reducing delay, where the system at least includes:

at least one client, wherein a P2P full connection architecture is adopted among a plurality of clients; a server connected to and interacting with each client, wherein a computer program is stored and run on the server and the client, and when executed by the server and the client, the computer program implements the following steps:

the strip is coded and compressed by adopting H.264 intra-frame hierarchical prediction, namely, the prediction result of the next level or above is used as a prediction reference during predictive coding;

As shown in fig. 9, the server further includes a NAT traversal server, configured to implement data interaction between the web server and the signaling server between different local area networks through the internet. In a common audio and video communication scenario, each terminal which generally performs communication is not in the same local area network, and it is supposed to access a device through the internet and needs to traverse the NAT device, so that it is also necessary to configure a NAT traversal server to realize a connection based on P2P, and servers which are commonly used for performing NAT traversal include STUN, TURN, and ICE servers.

Specifically, the client is applied to a webpage of a browser, and the client is much simpler to implement compared with a C/S framework. WebRTC provides a JavaScript interface, and a developer only needs to call the interface. When applying video conferencing, video and audio support of HTML5 is required because of the presence of audio video streams. The page function tag of the video conference is controlled through HTML5, CSS3 controls the styles of various elements of the interface to make the picture more beautiful, and JavaScript is a tool for realizing the interaction function of the interface.

In most scenarios, there are few users performing video chat, and in such scenarios, a peer-to-peer connection between two clients can be established. The media data is directly transmitted between the two clients without participation of the server. This network topology architecture that fully uses the P2P approach is referred to as the Mesh architecture. Under this architecture, each person participating in the video conference needs to establish a connection with other clients, as shown in fig. 10. Suppose that there are 4 clients in the current conference room, that is, each client is to establish connection with the remaining 3 clients, forming a P2P connection channel. Taking the client a as an example, after establishing a connection with the client B, two data streams are formed. One path of data flow is to push the streaming media data of the client a to the client B, the other path of data flow needs to receive the streaming media data from the client B, and after the connection between the client a and the other 3 clients is completely established, the client a needs to push the streaming media data to the other 3 clients, and simultaneously needs to receive the streaming media data of the 3 clients. Each client now carries 6 data streams like the a client. Obviously, the way based on the full connection of P2P does not need the participation of a central server, and has simple logic and easier implementation. However, with the increase of users, the CPU and the bandwidth of each client occupy a serious amount, and the more the number of users is, the worse the video conference effect is, and even the video conference system cannot be used normally; this architecture can only be used for video calls with a small number of people.

The SFU server is also equivalent to route forwarding in the video conference system, and the SFU receives media stream data of each client and pushes the data to each client. Corresponding to a video conference of 4 people. Each client will send its own media stream to the SFU server and each client will also receive the other 3 media streams from the SFU server. The specific architecture is shown in fig. 11. Obviously, the number of paths flowing down the SFU server-based video conference architecture is reduced by nearly half compared to the P2P-based full connection architecture. Each client has only one push stream sent out, but 3 pull streams downloaded. Therefore, the SFU-based video conference system can carry more than one time of conference based on P2P full connection.

In a video conference system based on a Mesh architecture, the quality of a video conference is tested starting from two users. It is found through testing that as the number of video conference users increases, the video starts to become unsmooth, and the main reason for the unsmooth is due to the frame rate of the video being too low. Therefore, the threshold value of the number of participating clients of the SFU network connection architecture depends on the upper limit of the number of clients under the Mesh architecture. Under the Mesh architecture, as the number of the participating clients increases, the memory and the CPU of the client occupy seriously, and the video frame rate is too low, the SFU architecture needs to be converted. The results of one of these tests are shown in the following table:

it can be seen from the table that as the number of people in the video conference room increases, the memory usage also increases continuously, the CPU utilization also increases continuously, and when the number of participants is less than 5, the CPU utilization increases relatively quickly, but as the number of people increases, the rate of increase slows down. This is because the codec and pixel display control of video are mainly controlled by the CPU. When the number of user terminals in the conference increases, the decoding code stream is correspondingly increased, the resolution of the video is actively reduced in order to meet the video fluency, and the CPU resource consumed by pixel display control is slowly increased. In the past, the frame rate of silent movies is generally between 16-24 frames, that is, when the frame rate is higher than 16FPS, the video is coherent to the human eye, but it is widely considered that the video stream should satisfy the frame rate higher than 24 FPS. Experimental data show that when the number of participating clients is more than 4, the video frame rate is lower than 24FPS, and therefore, the number of the participating clients borne at the maximum cannot be higher than 4 in the P2P direct connection mode based on the Mesh architecture. Thus, it is determined herein that the participant threshold with the SFU architecture is 5. Namely, when the number of the conference persons is only 2-4, the P2P direct connection mode is selected for conference, but when the number of the conference persons is more than or equal to 5, the SFU architecture conference mode is required.

Example 3

An embodiment of the present invention further provides a computer-readable storage medium, where a video codec program for reducing latency is stored on the computer-readable storage medium, and when the video codec program for reducing latency is executed by a processor, the steps of the video codec method for reducing latency as claimed in the claims are implemented.

The video coding and decoding method, the video conference system and the storage medium for reducing the time delay of the embodiment of the invention apply a method of dividing and coding the frame in the original coding and decoding algorithm, so that the frame is not coded after the frame image is collected in the standard algorithm any more, but each frame is divided into a plurality of strips, and then the coding is carried out by taking the strips as the unit. Since the data amount of the stripe itself is much smaller than that of one frame image, each stripe will not cause the too large difference of the data amount of the I frame and the P frame after being encoded in units of frames. The data volume can be guaranteed to be stable in a relatively small range, and then the encoding buffer area can be correspondingly reduced. Each strip image is transmitted after being coded, and after the strip images reach a decoder, the decoder can decode the data by buffering a plurality of strip images without buffering a plurality of frames of images in an image standard algorithm as reference for decoding. After the coding is divided into the strips, the size of the buffer area can be reduced by more than half compared with the original algorithm, and the time delay caused by the need of coding and decoding buffering is effectively reduced.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for reducing latency in video encoding and decoding, the method comprising:

2. The method of claim 1, wherein the dividing each I frame into individual slices and encoding each I frame in slices comprises:

3. The method of claim 1, wherein said encoding and compressing the slices by h.264 intra-frame layered prediction, i.e. using the prediction result of the next level or higher as a prediction reference in predictive coding, comprises:

4. The method of claim 1, wherein said encoding and compressing the slices by h.264 intra-frame layered prediction, i.e. using the prediction result of the next level or higher as a prediction reference in predictive coding, comprises:

drow represents the predicted distance for each row of pixels, Dtotal represents the total predicted distance for the block of data; the total predicted distance Dtotal is equal to the sum of the predicted distances Drow for each row.

5. The method of claim 4, wherein said encoding and compressing the slices by H.264 intra-frame layered prediction (HPP), that is, using the prediction result of the next level or higher as a prediction reference in predictive coding, comprises:

setting an average prediction distance Dpred, which means adding all distances between all current pixels and pixels for reference and then averaging;

D _{level_2} ＝2N×3＝6N；

D _{total_b} ＝D _{level_1} +D _{level_2} ＝N ₃ +6 _N ；

the average predicted distance Dpred _ b is given by:

D _{pred_b} ＝D _{total_b} /(2N×2N)＝(N ³ +6N)/4N ² 。

6. a video conferencing system, the system comprising at least:

7. The video conference system of claim 6, wherein the server side comprises a web server and a signaling server, the web server is used for responding to the page, registration, login and various function requests in the system of the client side; and the signaling server is used for providing the information exchange function of the client.

8. The video conferencing system of claim 6, wherein the server further comprises a NAT traversal server for implementing data interaction between the web server and the signaling server between different local area networks through the Internet.

9. The video conference system of claim 6, wherein the web server is an SFU server, and when the number of the clients reaches a preset threshold, an SFU architecture is adopted between the plurality of clients and the SFU server.

10. A computer-readable storage medium, having stored thereon a reduced-latency video codec program, which when executed by a processor, implements the steps of the reduced-latency video codec method according to any one of claims 1 to 5.