CN112040234B

CN112040234B - Video encoding method, video decoding method, video encoding device, video decoding device, electronic equipment and storage medium

Info

Publication number: CN112040234B
Application number: CN202011213490.5A
Authority: CN
Inventors: 张文杰; 豆修鑫; 宋嘉文; 徐琴琴; 樊鸿飞; 蔡媛
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2021-01-29
Anticipated expiration: 2040-11-04
Also published as: CN112040234A

Abstract

The application provides a video encoding method, a video decoding method, a video encoding device, a video decoding device, an electronic device and a storage medium, wherein the video encoding method comprises the following steps: acquiring an image group to be encoded of a video to be encoded, wherein the image group to be encoded comprises a first sub image group to be encoded and a second sub image group to be encoded; the method comprises the steps that intra refresh coding is carried out on a first sub image group to be coded, each first sub image to be coded in the first sub image group to be coded is divided into a plurality of coding regions, all the coding regions of the first sub image group to be coded comprise dirty regions and non-dirty regions, and the dirty regions are regions coded by referring to regions in other image groups except the image group to which the dirty regions belong; and coding the second sub image group to be coded, wherein each second sub image to be coded in the second sub image group to be coded only refers to a non-dirty area in the first sub image group to be coded for coding. By the method and the device, the problem that the real-time audio and video communication timeliness is poor due to overlarge time delay in the encoding and decoding mode is solved.

Description

Video encoding method, video decoding method, video encoding device, video decoding device, electronic equipment and storage medium

Technical Field

The present application relates to the field of communications technologies, and in particular, to a video encoding method and apparatus, a video decoding method and apparatus, an electronic device, and a storage medium.

Background

In a real-time audio-video communication (RTC) scenario, latency is an important technical indicator. In optimizing the delay index, both the control cost and the user experience (i.e., the video subjective quality) need to be considered.

In the related art, the RTC low-delay technology reduces the coding delay of the acquisition end by adopting the LDP coding configuration. Fig. 1 shows frame types of LDP coding and reference relationships between frames. As shown in fig. 1, the arrow direction indicates that reference is made, taking POC =3 frame as an example, it needs to be coded with reference to 0 th frame (long-term reference frame) and 2 nd frame (short-term reference frame), while it will be referred to by 4 th frame. In the conventional LDP coding method, when a P frame is coded, a long-term reference frame and a short-term reference frame are both used.

However, for the above coding method, because there is a dependency relationship between the reference frames of the coding and decoding, in the decoding process, when decoding an image frame, all the previous image frames need to be decoded first, resulting in a decoding delay of one GOP frame number at most.

To overcome the above problem, an ultra-low delay reference frame configuration as shown in fig. 2 may be adopted, in which each P frame only refers to its corresponding I frame and does not refer to its previous frame any more, i.e., each frame only retains long-term reference frames and discards short-term reference frames. Compared with the existing LDP mode, the scheme relieves the dependency relationship of each P frame during encoding or decoding, and reduces the encoding and decoding delay.

However, in the above-mentioned reference frame configuration scheme of ultra-low latency coding, the size of the I frame after coding is usually several times (usually about 2-10 times) the size of the P frame, and an excessively large I frame may cause transmission latency of several tens of milliseconds or even hundreds of seconds during transmission, which may seriously affect the timeliness of the RTC. Meanwhile, the code rate difference between the I frame and the P frame is large, and the code rate fluctuation needs to set a large buffer, so that the buffer delay is also high.

Therefore, the codec method in the related art has a problem of poor real-time audio/video communication timeliness caused by too large time delay (e.g., transmission time delay, buffering time delay).

Disclosure of Invention

The application provides a video coding method and device, a video decoding method and device, electronic equipment and a storage medium, which are used for at least solving the problem of timeliness of real-time audio and video communication caused by overlarge time delay in a coding and decoding mode in the related technology.

According to an aspect of an embodiment of the present application, there is provided a video encoding method, including: acquiring a to-be-encoded image group of a to-be-encoded video, wherein the to-be-encoded image group comprises a first to-be-encoded sub image group and a second to-be-encoded sub image group; performing intra-frame refreshing coding on the first sub-image group to be coded, wherein each first sub-image to be coded in the first sub-image group to be coded is divided into a plurality of coding regions, all the coding regions of the first sub-image group to be coded comprise a dirty region and a non-dirty region, and the dirty region is a region coded by referring to regions in other image groups except the image group to which the dirty region belongs; and encoding the second sub image group to be encoded, wherein each second sub image to be encoded in the second sub image group to be encoded only refers to the non-dirty region in the first sub image group to be encoded.

According to another aspect of embodiments of the present application, there is provided a video decoding method including: acquiring an image group to be decoded of a video to be decoded, wherein the image group to be decoded comprises a first sub image group to be decoded and a second sub image group to be decoded; performing intra-frame refresh decoding on the first sub image group to be decoded, wherein each first sub image to be decoded in the first sub image group to be decoded is divided into a plurality of decoding areas, all the decoding areas of the first sub image group to be decoded include a dirty area and a non-dirty area, and the dirty area is an area decoded by referring to areas in other image groups except the image group to which the dirty area belongs; and decoding the second sub image group to be decoded, wherein each second sub image to be decoded in the second sub image group to be decoded only refers to the non-dirty area in the first sub image group to be decoded for encoding.

According to another aspect of embodiments of the present application, there is provided a video encoding apparatus including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a to-be-encoded image group of a to-be-encoded video, and the to-be-encoded image group comprises a first to-be-encoded sub image group and a second to-be-encoded sub image group; the first encoding unit is used for carrying out intra-refresh encoding on the first sub-image group to be encoded, wherein each first sub-image to be encoded in the first sub-image group to be encoded is divided into a plurality of encoding regions, all the encoding regions of the first sub-image group to be encoded comprise dirty regions and non-dirty regions, and the dirty regions are regions which are encoded by referring to regions in other image groups except the image group to which the dirty regions belong; a second encoding unit, configured to encode the second sub image group to be encoded, where each second sub image to be encoded in the second sub image group to be encoded refers to only the non-dirty region in the first sub image group to be encoded for encoding.

Optionally, the first encoding unit includes: the first encoding module is used for carrying out intra-frame encoding on an intra-frame encoding area in a first image to be encoded, wherein the first image to be encoded is a sub-image in the first sub-image group to be encoded; a second encoding module, configured to, when the first image to be encoded contains the dirty region and the first image to be encoded is a first image of the first sub image group to be encoded, encode the dirty region in the first image to be encoded by using a first encoding reference frame in a previous image group of the image group to be encoded as a reference; and a third encoding module, configured to, when the first image to be encoded contains the dirty region and the first image to be encoded is not the first image of the first sub image group to be encoded, encode the dirty region in the first image to be encoded with reference to a previous image of the first image to be encoded and a second encoding reference frame in a previous image group of the image group to be encoded.

Optionally, the first encoding unit further includes: a fourth encoding module, configured to, when the first image to be encoded includes a clean region, encode the clean region in the first image to be encoded by using the non-dirty region in the first image of the first sub image group to be encoded and the non-dirty region in the previous image of the first image to be encoded as references.

Optionally, the first encoding unit includes: and the fifth encoding module is used for encoding a plurality of encoding areas contained in a second image to be encoded in parallel, wherein the second image to be encoded is a sub-image in the first sub-image group to be encoded.

Optionally, the video to be encoded is a video corresponding to a first client in real-time communication, and the apparatus further includes: and the transmission unit is used for transmitting a target video stream to the second client through the real-time communication connection between the first client and the second client after the image group to be coded of the video to be coded is obtained, wherein the target video stream comprises a sub-video stream obtained by coding the first sub-image group to be coded and a sub-video stream obtained by coding the second sub-image group to be coded.

According to still another aspect of embodiments of the present application, there is provided a video decoding apparatus including: the device comprises a first acquisition unit, a second acquisition unit and a decoding unit, wherein the first acquisition unit is used for acquiring a to-be-decoded image group of a to-be-decoded video, and the to-be-decoded image group comprises a first to-be-decoded sub image group and a second to-be-decoded sub image group; the first decoding unit is used for performing intra refresh decoding on the first sub image group to be decoded, wherein each first sub image to be decoded in the first sub image group to be decoded is divided into a plurality of decoding areas, all the decoding areas of the first sub image group to be decoded contain dirty areas and non-dirty areas, and the dirty areas are areas which are decoded by referring to areas in other image groups except the image group to which the dirty areas belong; a second decoding unit, configured to decode the second sub image group to be decoded, where each second sub image to be decoded in the second sub image group to be decoded refers to only the non-dirty region in the first sub image group to be decoded for encoding.

Optionally, the first decoding unit includes: the first decoding module is used for carrying out intra-frame decoding on an intra-frame decoding area in a first image to be decoded, wherein the first image to be decoded is a sub-image in the first sub-image group to be decoded; a second decoding module, configured to, when the first image to be decoded includes the dirty region and the first image to be decoded is a first image of the first sub image group to be decoded, use a first decoding reference frame in a previous image group of the image group to be decoded as a reference to decode the dirty region in the first image to be decoded; and a third decoding module, configured to, when the first image to be decoded contains the dirty region and the first image to be decoded is not the first image of the first sub image group to be decoded, decode the dirty region in the first image to be decoded by using a previous image of the first image to be decoded and a second decoded reference frame in a previous image group of the image group to be decoded as references.

Optionally, the first decoding unit further includes: a fourth decoding module, configured to, when the first image to be decoded includes a clean region, use the non-dirty region in the first image of the first sub image group to be decoded and the non-dirty region in the previous image of the first image to be decoded as references to decode the clean region in the first image to be decoded.

Optionally, the first decoding unit includes: and the fifth decoding module is used for decoding a plurality of decoding areas contained in a second image to be decoded in parallel, wherein the second image to be decoded is a sub-image in the first sub-image group to be decoded.

Optionally, the video to be decoded is a video corresponding to a first client in real-time communication, and the first obtaining unit includes: and the acquisition module is used for acquiring the image group to be decoded, which is transmitted by the first client through the real-time communication connection between the first client and the second client.

Optionally, the obtaining module includes: the detection submodule is used for detecting joining operation executed on the second client, wherein the joining operation is used for joining real-time communication among a plurality of first clients; and the receiving submodule is used for responding to the joining operation and receiving the video to be decoded transmitted by a target client in the plurality of first clients, wherein the video to be decoded is a video starting from the current moment in the video corresponding to the target client.

Optionally, the apparatus further comprises: and the second acquisition unit is used for acquiring a target reference frame corresponding to the video to be decoded after detecting the joining operation executed on the second client, wherein the target reference frame is an image frame which is positioned before a starting frame in an image group in which the starting frame of the video to be decoded is positioned and is referred to by at least one image frame in the video to be decoded.

According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory communicate with each other through the communication bus; wherein the memory is used for storing the computer program; a processor for performing the method steps in any of the above embodiments by running the computer program stored on the memory.

According to a further aspect of the embodiments of the present application, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the method steps of any of the above embodiments when the computer program is executed.

In the embodiment of the application, a mode of combining intra-frame refreshing and ultra-low delay reference frame configuration is adopted, and a to-be-coded image group of a to-be-coded video is obtained, wherein the to-be-coded image group comprises a first to-be-coded sub image group and a second to-be-coded sub image group; performing intra-frame refreshing coding on a first sub-image group to be coded, wherein each first sub-image to be coded in the first sub-image group to be coded is divided into a plurality of coding regions, all the coding regions of the first sub-image group to be coded comprise dirty regions and non-dirty regions, and the dirty regions are regions coded by referring to regions in other image groups except the image group to which the dirty regions belong; and coding the second sub image group to be coded, wherein each second sub image to be coded in the second sub image group to be coded only refers to a non-dirty area in the first sub image group to be coded, and the I frame can be embedded into a plurality of P frames or B frames by adopting an intra-frame refreshing mode, so that the aim of keeping the code rate stable can be realized, the technical effects of reducing transmission delay and buffering delay and improving the timeliness of audio and video communication are achieved, and the problem of real-time audio and video communication timeliness caused by overlarge time delay in a coding and decoding mode in the related technology is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a schematic diagram of an alternative LDP coding mode;

FIG. 2 is a schematic diagram of an alternative LDP coding mode;

FIG. 3 is a schematic diagram of a hardware environment for an alternative video encoding method according to an embodiment of the present invention;

FIG. 4 is a flow chart illustrating an alternative video encoding method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative video encoding flow according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an alternative video encoding flow according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an alternative RTC full link flow according to an embodiment of the present application;

FIG. 8 is a flow chart illustrating an alternative video decoding method according to an embodiment of the present application;

FIG. 9 is a flow chart illustrating an alternative low-delay encoding method according to an embodiment of the present application;

fig. 10 is a block diagram of an alternative video encoding apparatus according to an embodiment of the present application;

FIG. 11 is a block diagram of an alternative video encoding apparatus according to an embodiment of the present application;

fig. 12 is a block diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the embodiments of the present application better understood, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, partial nouns or terms appearing in the description of the embodiments of the present application are applicable to the following explanations:

1. video coding: the method is a method for converting a file in an original video format into a file in another video format by a compression technology, and common video coding and decoding standards are h.264, h.265, AVS, AV1 and the like.

RTC (Real-time Communications): the most typical applications are live broadcast and live broadcast, real-time audio and video call, video conference, interactive online education, etc. In terms of functional flow, the RTC includes many links, such as an acquisition end (acquisition, preprocessing, and encoding), a transmission end (transmission from the acquisition end to a server, between servers, and from the server to a playback end), and a playback end (decoding, buffering, and rendering).

3. Delaying: is an important index in network transmission, and characterizes the time required by data from one end point to another end point, and generally uses milliseconds, seconds and the like as units. The delay in RTC generally refers to the time interval from the acquisition end starting video acquisition to the playback end completing video rendering.

TCP (Transmission Control Protocol)/UDP (User Datagram Protocol): the two most common underlying network transport protocols are those used to send data bits (called packets) over the Internet (the network), but both work in different ways. TCP is characterized by reliable but slow data transmission, and UDP is characterized by fast speed, low latency, and potential packet loss. The RTC scenario usually selects a UDP-core Transport protocol, such as SRT (Secure Reliable Transport, a video streaming protocol), QUIC (Quick UDP Internet Connection), etc.

5. Network jitter: due to the size of the data packet, the routing of the network route, and other factors, the delay time of the data packet cannot be guaranteed to be consistent, and the difference between the delay of the data packet and the delay of the data packet is called jitter. That is, the phenomenon that the delay value of a packet is small and large is called jitter.

6. Coding delay: the delay generated in the encoding process is the time consumed by inputting the video frame (namely, the image frame and the video image frame) to the code stream generated after the encoding is finished.

LDP (Low Delay P, Low Delay P frame) coding: the first frame in each GOP is coded by I frame, the following frames are all coded by P frame, when coding each P frame, only the image before the frame in playing order is referred. By avoiding backward reference, the coding and decoding sequence is ensured to be consistent with the display sequence, and the coding and decoding delay is reduced. In addition to the LDP coding mode, there are All-Intra (full I-frame) coding configurations and Random-Access (Random Access) coding configurations in video coding.

8. Coded frame types, which are generally classified into 3 types: the I frame (intra-coded frame) is also called a key frame, is used as a random access point in a video stream, is coded in an intra-frame prediction mode (intra-frame coding), does not refer to other frames, and is generally high in coding quality and low in compression efficiency; a P frame (predictive coding frame) is coded by referring to a forward I frame or other forward P frames in an interframe prediction mode or an interframe and interframe prediction combined mode, and the compression efficiency is high; b frames (bidirectional predictive coding frames) can be predictive coded by referring to the frames in the forward direction and the backward direction, and the compression efficiency is highest.

GOP (Group Of Pictures, coding Group): in video coding, a GOP is a set of multi-frame consecutive encoded frame sequences used to aid random access in decoding, typically each GOP beginning with an I-frame.

POC (Picture Order Count, Picture Order): which represents the display order of the source video frames when encoding video.

11. Respiratory effect: due to video coding, the video quality gradually deteriorates in a GOP and suddenly becomes good until the starting point of the next GOP; or a phenomenon that the video quality is suddenly deteriorated at the start of a certain GOP and gradually becomes better within the GOP.

12. Slice/Tile, each frame image can be divided into a plurality of slices or tiles, each Slice or Tile represents a block, can be a rectangular or irregular image, and can be independently coded in each block.

PIR (Periodic Intra Refresh ): by embedding the intra-coded tiles or slices into the P frame or the B frame, the effect of replacing I frames is achieved, and meanwhile, the code rate is kept stable.

CMOS (Complementary Metal Oxide Semiconductor): refers to a technique for fabricating large scale integrated circuit chips or chips fabricated using such a technique.

15. Distortion of video coding: the difference between the original video before encoding and the video after encoding.

16. Code rate of video coding: the number of bits per second of the encoded video, typically in kbps (kilobits per second).

17. Lagrange coefficient: parameters for balancing video distortion and video bitrate in video coding.

18. Original pixel/predicted pixel/residual: the original pixel refers to the original pixel value before video coding, the predicted pixel refers to the pixel value obtained according to intra-frame or inter-frame prediction during coding, and the residual: the difference of the original pixel and the predicted pixel.

VR (Virtual Reality): it is a technology for providing an immersive sensation in an interactive three-dimensional environment generated on a computer by comprehensively using a computer graphic system and various interface devices such as display and control.

According to an aspect of an embodiment of the present application, there is provided a video encoding method. Alternatively, in this embodiment, the video encoding method described above may be applied to a hardware environment formed by an encoding end (encoding device, first device) 302, a decoding end (decoding device, second device) 304, and a playing device 306 as shown in fig. 3. As shown in fig. 3, the encoding end 302 is connected to the decoding end 304 through a network, and a database may be provided on the encoding end 302 (and/or the decoding end 304) or independent of the encoding end 302 (and/or the decoding end 304) for providing a data storage service for the encoding end 302 (and/or the decoding end 304). The decoding end 304 and the playing device 306 may be two devices that are independently arranged, or may be the same device, which is not limited in this embodiment.

As shown in fig. 3, the encoding end 302 may be configured to encode an input video to be transmitted (e.g., a video to be encoded, a second audio/video) to obtain a corresponding video code stream, and transmit the video code stream to the decoding end 304 through a network; the decoding end 304 may be configured to decode the received video code stream to obtain a corresponding video (or a video frame), and play the obtained video (or the video frame) through the playing device 306.

Such networks may include, but are not limited to: the encoding end 302 and the decoding end 304 may be terminal devices or servers, and may be, but not limited to, at least one of the following: a PC, a cell phone, a tablet, a VR device, etc. The video encoding method according to the embodiment of the present application may be executed by the encoding end 302, where the encoding end 302 may be a terminal device or a server. The terminal device executing the video coding method of the embodiment of the present application may also be executed by a client installed thereon.

Taking the video encoding method in the present embodiment executed by the encoding end 302 as an example, fig. 4 is a schematic flowchart of an alternative video encoding method according to an embodiment of the present application, and as shown in fig. 4, the flowchart of the method may include the following steps:

step S402, acquiring a group of images to be coded of a video to be coded, wherein the group of images to be coded comprises a first group of sub images to be coded and a second group of sub images to be coded.

The video encoding method in this embodiment may be applied to scenes with video transmission requirements, such as live broadcast, RTC, VR, and the like, where the video may be live broadcast video, real-time audio and video, panoramic video, and the like, and this is not limited in this embodiment.

The encoding device may encode a video to be encoded, and the video to be encoded may be a video to be transmitted to the decoding end and played by the playing device. The video to be encoded may contain a plurality of groups of pictures, each group of pictures may contain a plurality of image frames, and the POC of the video frames within a group of pictures may be consecutively numbered starting from 0. The group of pictures to be currently encoded in the video to be encoded may be a group of pictures to be encoded. The image group to be encoded can be divided into a first sub-image group to be encoded by intra refresh encoding and a second sub-image group to be encoded by non-intra refresh encoding.

For example, a current group of pictures to be encoded includes 9 video frames, and according to the playing order of the video frames, the POC of the 9 video frames is: 0, 1, 2, 3, 4, 5, 6, 7, 8, where the first 4 video frames may be a first sub-group of pictures to be encoded, and the last 5 video frames may be a second sub-group of pictures to be encoded.

Step S404, performing intra refresh encoding on the first sub image group to be encoded, where each first sub image to be encoded in the first sub image group to be encoded is divided into a plurality of encoding regions, all the encoding regions of the first sub image group to be encoded include a dirty region and a non-dirty region, and the dirty region is a region encoded with reference to a region in another image group other than the image group to which the dirty region belongs.

The first sub image group to be coded can be coded by adopting an intra refresh coding mode. Each first sub-image to be encoded (image frame) in the first sub-image group to be encoded may be divided into a plurality of encoding regions (Slice/Tile), and the dividing manner of different first sub-images to be encoded may be the same. All the encoding regions of the first sub-image group to be encoded include a dirty region and a non-dirty region, and the dirty region is a region encoded with reference to a region in another image group (for example, a previous image group of the image group to be encoded) other than the image group to which the dirty region belongs.

For an image frame, the dividing mode of the image frame may be different according to the intra refresh direction and the intra refresh mode. The direction of intra refresh may be vertical, horizontal, or irregular non-rectangular area. The direction of intra refresh can be adaptively adjusted according to the video content. Correspondingly, the image may be divided into different regions in a longitudinal division manner, a transverse division manner, or an irregular division manner. The number of the coding regions divided from one image frame may be the same or different according to different dividing manners. In the present embodiment, the image frame division method, the intra refresh method, and the like are not limited.

The intra refresh means: a complete I frame image of one frame is divided into a plurality of I blocks to be embedded into a P frame or a B frame for coding. Fig. 5 shows an example of one-cycle intra refresh, and as shown in fig. 5, the refresh completion period is 4, i.e., each frame is divided into 4 blocks, and 4 frames complete one round of intra refresh. After intra refresh is used, the first frame in a GOP may not be an I-frame, all frames in fig. 5 are P-frames, and each frame is divided into a forced intra-coded region and a P-frame coded mode region.

Note that, embedding an I block in a P frame or a B frame means: instead of replacing data of some blocks of a P frame or a B frame after blocking the I frame, some blocks (regions) of the P frame or the B frame are encoded as I frame blocks.

In an alternative encoding scheme, frame 1 of a GOP is an I-frame, where 4 blocks are all I-blocks; frames 2, 3 and 4 are P or B frames and the 4 blocks of each frame are all non-I blocks. For intra refresh mode, 4I blocks in a frame can be allocated to 4 frames. A portion of

frames

1, 2, 3 and 4 are I blocks and the remaining portion are non-I blocks, respectively, and each frame may not be referred to as an I frame, a P frame or a B frame.

It should be noted that periodic intra refresh can be applied to non-first GOPs, and the first frame of the entire video encoding start needs to be a complete I-frame.

Referring to fig. 5, each frame within one refresh period is divided into a forced intra-coded region and a P-frame coded mode region. The clean area and the dirty area represent P-frame coding mode areas, wherein the clean area can only be coded with reference to the non-dirty areas (intra-coded area and clean area) of the frame located before the clean area in the same GOP; the dirty region may be encoded with reference to all regions of the previous frame.

As shown in fig. 5, intra-coding starts at the left 1 block of the 1 st frame of the GOP, and the remaining blocks are coded with reference to a certain frame of the last GOP in a P-frame coding mode (both intra-coding and inter-coding can be selected); the left 2 blocks of the 2 nd frame are coded in the intraframe, the rest blocks are coded in the P frame coding mode, wherein, the left 1 block of the clean area can only refer to the left 1 block of the 1 st frame, and the left 3 and left 4 blocks of the dirty area can refer to all the blocks of the 1 st frame and one frame of the previous GOP. Frame 3 and frame 4 and so on, clean regions within each frame refer to all non-dirty regions in previous frames within the GOP.

Step S406, encoding a second sub image group to be encoded, where each second sub image to be encoded in the second sub image group to be encoded refers to only a non-dirty region in the first sub image group to be encoded.

The second sub image group to be encoded may include one or more second sub images to be encoded, and when encoding one second sub image to be encoded, the encoding may be performed with reference to the non-dirty region in the first sub image group to be encoded and the image frame located before the non-dirty region.

For example, as shown in fig. 5, after the first 4 frames have completed a round of intra refresh, starting from the 5 th frame and being a normal P frame, all non-dirty regions of the previous 4 frames and the previous P frame can be referenced for encoding (in case of a normal P frame preceding them).

Optionally, in this embodiment, an ultra-low delay encoding configuration may be adopted, that is, a non-dirty region in the first sub-image group to be encoded is used as an I-frame, and the second sub-image to be encoded refers to only the non-dirty region in the first sub-image group to be encoded for encoding.

Optionally, the non-dirty region may include at least one of: intra-coded area, clean area. The second sub-image coding reference may be all non-dirty regions, may only refer to all intra-coded regions, and may also refer to only a clean region, and a specific reference configuration mode may be selected as needed, which is not limited in this embodiment.

As an alternative implementation, each second sub-image to be encoded in the second sub-image group to be encoded may be divided into a plurality of first regions, and each first region may be encoded with reference to only a non-dirty region in the first sub-image group to be encoded. Optionally, a plurality of first regions in each second sub-image to be encoded may be decoded in parallel.

It should be noted that the encoding method provided in this embodiment may be performed in some scenarios, for example, a scenario where network transmission delay is high, a multiparty real-time call is made, and the like. The encoder may switch the encoding system when a switching condition is satisfied, for example, switching from the encoding system to another encoding system with higher image quality, where the switching condition may be: the network transmission delay is lower than a threshold, and the multiparty real-time call is switched to a two-party real-time call, which is not limited in this embodiment.

Through the steps S402 to S406, acquiring a to-be-encoded image group of a to-be-encoded video, wherein the to-be-encoded image group includes a first to-be-encoded sub image group and a second to-be-encoded sub image group; performing intra-frame refreshing coding on a first sub-image group to be coded, wherein each first sub-image to be coded in the first sub-image group to be coded is divided into a plurality of coding regions, all the coding regions of the first sub-image group to be coded comprise dirty regions and non-dirty regions, and the dirty regions are regions coded by referring to regions in other image groups except the image group to which the dirty regions belong; and coding the second sub image group to be coded, wherein each second sub image to be coded in the second sub image group to be coded only refers to a non-dirty area in the first sub image group to be coded, so that the problem of real-time audio and video communication timeliness caused by overlarge time delay in a coding and decoding mode in the related technology is solved, transmission delay and buffering delay are reduced, and the audio and video communication timeliness is improved.

As an alternative embodiment, the intra refresh encoding the first sub image group to be encoded includes:

s11, intra-frame coding the intra-frame coding area in the first image to be coded, wherein the first image to be coded is a sub-image in the first sub-image group to be coded;

s12, in case that the first image to be encoded contains a dirty region and the first image to be encoded is the first image of the first sub image group to be encoded, encoding the dirty region in the first image to be encoded with reference to the first encoding reference frame in the previous image group of the image group to be encoded;

s13, in case that the first image to be encoded contains a dirty region and the first image to be encoded is not the first image of the first sub-group of images to be encoded, encoding the dirty region in the first image to be encoded with reference to an image preceding the first image to be encoded and the second encoding reference frame in the group of images preceding the group of images to be encoded.

For a first sub-image of the first sub-image group to be encoded, for example, the first image to be encoded may include at least one of: intra-coded region, dirty region, clean region. The encoding can be done in different ways for different types of regions.

For the intra-frame coding region, the coding apparatus may perform intra-frame coding on the intra-frame coding region, and the intra-frame coding manner may refer to the related art, which is not limited in this embodiment.

For the dirty region, when the first image to be encoded is at a different position in the image group to be encoded, the image frame or region referred by the dirty region contained in the first image to be encoded may be different. For example, if the first image to be encoded contains a dirty region and the first image to be encoded is the first image of the first sub-image group to be encoded, the dirty region in the first image to be encoded may be encoded with reference to a certain frame (e.g., the first encoding reference frame) in the previous image group of the image group to be encoded. For another example, if the first image to be encoded contains a dirty region and the first image to be encoded is not the first image of the first sub-image group to be encoded, the dirty region in the first image to be encoded may be encoded with reference to a previous image of the first image to be encoded and a certain frame (e.g., the second encoding reference frame) in a previous image group of the image group to be encoded.

By the embodiment, the intra-coding region and the dirty regions at different positions are coded by adopting different coding configurations, so that the coding efficiency can be improved, and the accuracy and the rationality of the coding and decoding process can be improved.

As an alternative embodiment, the intra refresh encoding the first sub image group to be encoded further includes:

s21, in the case that the first image to be encoded contains a clean region, encoding the clean region in the first image to be encoded with reference to a non-dirty region in the first image of the first sub-image group to be encoded and a non-dirty region in a previous image of the first image to be encoded.

If the first image to be encoded contains a clean region, the clean region contained in the first image to be encoded can be encoded by referring to a non-dirty region in the first image of the first sub image group to be encoded and a non-dirty region in the previous image of the first image to be encoded.

For example, as shown in fig. 6, the 1 st, 2 nd, and 3 rd blocks in the 4 th frame may be encoded with reference to the 1 st, 2 nd, and 3 rd blocks in the 1 st and 3 rd frames.

By the embodiment, the clean area in one frame is coded by only referring to the first frame in the image group and the non-dirty area in the previous frame, so that the coding efficiency can be improved.

As an alternative embodiment, the encoding the first sub image group to be encoded includes:

s31, encoding a plurality of encoding regions contained in a second image to be encoded in parallel, wherein the second image to be encoded is a sub-image in the first sub-image group to be encoded.

Each first sub-image to be encoded in the first sub-image group to be encoded may include a plurality of encoding regions, and when a clean region in the first sub-image group to be encoded is encoded, all non-dirty regions (including non-dirty regions in the same frame) in the first sub-image group before the clean region may be referred to for encoding.

For example, as shown in fig. 5, in encoding the second clean area in the 3 rd frame, the 1 st block of the 1 st frame, the 1 st and 2 nd blocks of the second frame, and the 1 st block of the 3 rd frame may be referred to for encoding.

Alternatively, in the present embodiment, when each first sub-image to be encoded is encoded, all the encoded regions (e.g., 4 blocks in fig. 5) in the first sub-image to be encoded may be encoded in parallel, i.e., the intra-encoded region, the clean region, and the dirty region may be encoded at the same time.

The clean region in one refresh period (e.g. the first sub-image group to be encoded) may be configured to be encoded only with reference to the non-clean region in the image frame before the image frame where the same image group is located, so as to ensure that each first sub-image to be encoded can perform parallel encoding and decoding on the plurality of encoding regions contained in the first sub-image to be encoded.

By the embodiment, the efficiency of video coding and decoding can be improved by carrying out parallel coding and decoding on a plurality of areas in one image frame.

As an alternative embodiment, after acquiring a group of pictures to be encoded of a video to be encoded, the method further includes:

and S41, transmitting a target video stream to the second client through the real-time communication connection between the first client and the second client, wherein the target video stream comprises a sub-video stream obtained by encoding the first sub-image group to be encoded and a sub-video stream obtained by encoding the second sub-image group to be encoded.

The video to be encoded may be a video corresponding to the first client in real-time communication (RTC). The real-time communication may be, but is not limited to, at least one of: the video to be encoded may be a video of live broadcast and live broadcast, a real-time audio/video call, a video conference, interactive online education, and the like, and the video to be encoded is not limited in this embodiment.

It should be noted that RTC requires lower latency compared to live. One specific application of RTC may be live microphone in live scenes, i.e. low latency live. Common live broadcast generally adopts a TCP protocol, content distribution is performed by using a CDN (content distribution network), delay of several seconds or even more than ten seconds is generated, and interaction between a main broadcast and audiences can be performed only by means of short text messages and the like. The live broadcast microphone is connected by a UDP protocol, the content is transmitted in real time, the anchor and the audience can carry out audio and video microphone connection interaction and real-time communication, and the time delay is generally as low as hundreds of milliseconds.

For an RTC scenario, multiple devices participating in real-time communication may be communicatively connected via a server over a network. The RTC full link flow is shown in fig. 7, and the involved devices can be divided into three terminals: the system comprises a collection end, a transmission end and a playing end. For a certain device, when it performs audio and video acquisition, it can be used as an acquisition end, and when it receives, decodes and plays a video stream, it can be used as a playing end.

Optionally, in this embodiment, a first client may be run on the first device, a second client may be run on the second device, and a real-time communication connection may be established between the first client and the second client. When the client is used as the acquisition end, each client can acquire audio and video by calling audio and video acquisition equipment (such as a built-in camera and an external camera) on or external to the corresponding equipment. When the client is used as a playing end, each client can receive the video stream transmitted by the opposite end, decode the video stream, and play the decoded audio and video through a display component (e.g., a screen) of the corresponding device.

In the process of real-time communication, the first device may perform audio and video acquisition through a target acquisition device (e.g., a camera) corresponding to the first client, so as to obtain a to-be-encoded video to be transmitted. The video to be encoded may belong to real-time audio and video acquired by a target acquisition device. For the acquired real-time audio and video, the first device (the encoder on the first device) may encode the real-time audio and video to obtain a target video stream.

In the real-time communication process, the delay generation can be divided into three aspects as follows:

(1) a delay at an acquisition end, comprising: the time consumption of CMOS imaging and color format conversion, the time consumption of preprocessing of image contents such as beautifying, denoising and the like, and the delay caused by the time consumption of coding;

(2) delay at a transmission end, comprising: the transmission delay from the acquisition end of the full link to the server to the playing end is set, and the influencing factors comprise the size of transmission data, a transmission protocol, a transmission network environment and the like;

(3) the delay of the playing end includes a video decoding delay, a buffering delay for preventing network jitter, and a rendering delay of the playing device, etc.

For the RTC low-delay mode, the coding delay of the acquisition end is reduced through LDP coding configuration; the delay of a transmission end is reduced by replacing a TCP protocol with a customized UDP protocol (such as QUIC); the delay of the playing end is reduced by optimizing the size of the dynamic buffer of the playing end.

By adopting the existing LDP coding and decoding mode, because the dependency relationship exists between the coded and decoded reference frames, in the decoding process, when one image frame is decoded, all the previous image frames need to be decoded first, and the decoding delay of one GOP frame number is caused to be maximum. However, the reference frame configuration scheme with ultra-low delay coding as shown in fig. 2 has the problem of high transmission delay and buffering delay.

However, if the size of the I frame is simply reduced, the distortion of the I frame is severely increased, resulting in a more severe respiratory effect. And all the P frames in the GOP in the ultra-low delay reference frame scheme only refer to the I frame, so that the distortion of the I frame is increased, which may cause the residual error of all the subsequent frames to be increased, and seriously affect the coding quality of all the P frames.

Optionally, in this embodiment, for an RTC scene, a coding and decoding manner combining intra refresh and an ultra-low delay reference frame is adopted. The intra-frame refreshing is to embed the Tiles or Slces of intra-frame coding into a plurality of P frames or B frames from one I frame so as to achieve the effect of replacing the I frame, thereby keeping the code rate stable, obtaining the performance of low delay and also achieving the effect of preventing error diffusion.

For the target video stream obtained by encoding, different image frames in the video to be encoded correspond to different sub-video streams in the target video stream. The first device may encode a first sub-group of pictures to be encoded by using the above encoding method to obtain a corresponding sub-video stream, and encode a second sub-group of pictures to obtain a corresponding sub-video stream, where the target video stream may include a sub-video stream obtained by encoding the first sub-group of pictures to be encoded and a sub-video stream obtained by encoding the second sub-group of pictures to be encoded.

It should be noted that, for each image frame in the image group to be encoded, encoding may be performed in sequence according to the playing order, and for each image frame, a copy may be stored in the buffer memory so as to be used for reference of other image frames during encoding. After the video code stream corresponding to one image frame is obtained through encoding, the first device can directly send the video code stream to the second device through real-time communication connection, and the waiting for encoding and decoding results of other image frames is not needed.

The video code stream corresponding to each image frame may carry reference indication information, where the reference indication information may indicate a coding mode of the current frame or each region of the current frame, for example, an intra-frame coding region, a clean region, a dirty region, which frame or which region is referred to for coding, and may also be used to indicate other coding parameters, which is not limited in this embodiment.

The first device may stream the target video to the second client over a real-time communication connection between the first client and the second client. For example, a first device may transmit a target video stream to a server over a network, which forwards the target video stream to a second device (second client).

For the second device, the second device may receive the target video stream transmitted by the first client over the real-time communication connection between the first client and the second client, for example, from a server.

By the embodiment, the time effectiveness of real-time audio and video can be ensured by applying the coding modes of intra-frame refreshing and ultra-low delay reference frames to a real-time communication scene.

According to another aspect of the embodiment of the present application, there is also provided a video decoding method. Alternatively, in this embodiment, the video decoding method may be applied to a hardware environment formed by the encoding end 302, the decoding end 304 and the playing device 306 as shown in fig. 3. The description is already given and will not be repeated herein.

The video decoding method of the embodiment of the present application may be performed by the decoding end 304, and the decoding end 304 may be a terminal device (e.g., a second device). The terminal device executing the video decoding method according to the embodiment of the present application may also be executed by a client installed thereon. Taking the decoding end 304 (second device) to execute the video decoding method in this embodiment as an example, fig. 8 is a schematic flowchart of an alternative video decoding method according to this embodiment, and as shown in fig. 8, the flowchart of this method may include the following steps:

step S802, acquiring an image group to be decoded of a video to be decoded, wherein the image group to be decoded comprises a first sub image group to be decoded and a second sub image group to be decoded.

The video decoding method in this embodiment may be used to decode a video code stream obtained by encoding a group of pictures to be encoded by any one of the above video encoding methods. The decoding device (decoding end, second device) may obtain a video code stream transmitted by the device through a network, that is, a to-be-decoded image group of a to-be-decoded video, where the to-be-decoded image group may be divided into a first to-be-decoded sub image group decoded by intra refresh and a second to-be-decoded sub image group decoded by non-intra refresh. The division is similar to the above, and is not described herein.

Step S804, performing intra refresh decoding on the first sub image group to be decoded, wherein each first sub image group to be decoded in the first sub image group to be decoded is divided into a plurality of decoding areas, all the decoding areas of the first sub image group to be decoded include a dirty area and a non-dirty area, and the dirty area is an area to be decoded with reference to an area in another image group other than the image group to which the dirty area belongs.

And decoding the first sub image group to be decoded by adopting an intra refresh decoding mode. Each first to-be-decoded sub-image in the first to-be-decoded sub-image group is divided into a plurality of decoding areas, all the decoding areas of the first to-be-decoded sub-image group include a dirty area and a non-dirty area, the dirty area refers to an area in another image group except the image group to which the dirty area belongs for decoding, and the dividing manner of the decoding areas is similar to that of the encoding areas, which is not described herein again.

The decoding apparatus may determine a first reference relationship corresponding to the first sub image group to be decoded, the first reference relationship being used to indicate a decoding manner of each decoded region and an image frame or region referred to by each decoded region. The first reference relationship may be carried in the video stream corresponding to the first sub image group to be decoded, or may be calculated by the encoding apparatus according to convention, or may be represented by reference indication information stored in advance. According to the first reference relationship, the decoding device can decode each first sub-image to be decoded in the first sub-image group to be decoded to obtain the corresponding image frame. The decoding process corresponds to the encoding process, and is not described herein again.

Step S806, decoding the second sub image group to be decoded, where each second sub image to be decoded in the second sub image group to be decoded refers to only a non-dirty region in the first sub image group to be decoded for encoding.

The decoding device may adopt an ultra-low delay decoding configuration mode, that is, a non-dirty region in the first sub image group to be decoded is taken as an I frame, and the second sub image group to be decoded only refers to the non-dirty region in the first sub image group to be decoded for decoding.

Optionally, the non-dirty region may include at least one of: intra-decoded area, clean area. The second sub-image decoding may refer to all non-dirty regions, may refer to all intra-frame decoding regions only, and may also refer to only a clean region, and a specific reference configuration mode may be selected as needed, which is not limited in this embodiment.

As an alternative implementation, each of the second sub-image group to be decoded in the second sub-image group to be decoded may be divided into a plurality of second regions, and each of the second regions may be decoded with reference to only the non-dirty region in the first sub-image group to be decoded. Optionally, a plurality of second regions in each second sub-picture to be decoded may be decoded in parallel.

Acquiring an image group to be decoded of the video to be decoded through the steps S802 to S806, wherein the image group to be decoded includes a first sub image group to be decoded and a second sub image group to be decoded; performing intra-frame refreshing decoding on a first to-be-decoded sub image group, wherein each first to-be-decoded sub image in the first to-be-decoded sub image group is divided into a plurality of decoding areas, all the decoding areas of the first to-be-decoded sub image group contain dirty areas and non-dirty areas, and the dirty areas are areas which are decoded by referring to areas in other image groups except the image group to which the dirty areas belong; and decoding the second sub image group to be decoded, wherein each second sub image to be decoded in the second sub image group to be decoded only refers to a non-dirty area in the first sub image group to be decoded for encoding, so that the problem of real-time audio and video communication timeliness caused by overlarge time delay in a coding and decoding mode in the related technology is solved, transmission delay and buffering delay are reduced, and the audio and video communication timeliness is improved.

As an alternative embodiment, the intra refresh decoding of the first sub image group to be decoded includes:

s51, performing intra decoding on an intra decoding area in a first image to be decoded, where the first image to be decoded is a sub-image in the first sub-image group to be decoded;

s52, decoding the dirty region in the first image to be decoded with the first decoding reference frame in the previous image group of the image group to be decoded as a reference, when the first image to be decoded contains the dirty region and the first image to be decoded is the first image of the first sub image group to be decoded;

s53, in a case where the first image to be decoded contains a dirty region and the first image to be decoded is not the first image of the first sub image group to be decoded, decoding the dirty region in the first image to be decoded with reference to an image preceding the first image to be decoded and a second decoded reference frame in an image preceding the image group to be decoded.

For a first sub-image in the first sub-image group to be decoded, for example, the first image to be decoded may include at least one of the following: intra-decoded area, dirty area, clean area. For different types of regions, decoding may be performed in different ways.

The decoding method for the intra-frame decoding region and the dirty region may be similar to the encoding method for the intra-frame coding region and the dirty region, and is not described herein again.

As an alternative embodiment, the intra refresh decoding of the first sub image group to be decoded further includes:

s61, in the case where the first image to be decoded includes a clean region, the clean region in the first image of the first sub image group to be decoded is decoded with the non-dirty region in the first image and the non-dirty region in the previous image of the first image to be decoded as references.

If the first image to be decoded contains a clean region, the clean region contained in the first image to be decoded can refer to a non-dirty region in the first image of the first sub image group to be decoded and a non-dirty region in the previous image of the first image to be decoded for decoding.

As an alternative embodiment, the decoding the first sub image group to be decoded includes:

s71, decoding a plurality of decoding areas included in a second image to be decoded in parallel, where the second image to be decoded is a sub-image in the first sub-image group to be decoded.

When each first to-be-decoded sub-picture is decoded, all decoded regions (e.g., 4 blocks in fig. 5) in the first to-be-decoded sub-picture can be decoded in parallel, i.e., an intra-decoded region, a clean region, and a dirty region can be decoded at the same time. The parallel decoding process is similar to the parallel encoding process, and is not described herein again.

As an alternative embodiment, the obtaining a group of pictures to be decoded of a video to be decoded includes:

and S81, acquiring a video to be decoded transmitted by the first client, wherein the video to be decoded is transmitted through the real-time communication connection between the first client and the second client, and the video to be decoded comprises an image group to be decoded.

The video to be decoded may be a video corresponding to the first client in real-time communication, for example, the first client and the second client may establish a real-time communication connection in the foregoing manner. The second client may receive the video stream transmitted by the first client, for example, the video stream corresponding to the video to be decoded, through the established real-time communication connection.

As an alternative embodiment, the obtaining the video to be decoded transmitted by the first client includes:

s91, detecting a joining operation executed on a second client, wherein the joining operation is used for joining the real-time communication among a plurality of first clients;

and S92, responding to the joining operation, and receiving the video to be decoded transmitted by a target client in the plurality of first clients, wherein the video to be decoded is the video starting from the current moment in the video corresponding to the target client.

The video to be decoded is acquired after the second client enters a real-time call with the first client. If the real-time call is a call between the first client and the second client, a complete process is from the start of the real-time call to the end of the real-time call, and the video of the client can be encoded and decoded according to the image group, and the encoding and decoding processes are similar to those described above, and are not described herein again.

Optionally, the real-time call may also be a call between the second client and the plurality of first clients. Real-time conversations have been opened between multiple first clients before the second client joined. The second device may detect, through the touch screen or other input component, a join operation performed on the second client, where the join operation is used to join the real-time conversations among the multiple first clients, for example, a user may click an entry for joining the real-time conversation in a multi-user chat session to enter an ongoing real-time conversation.

In response to the join operation, the second client may jump to the real-time call interface, and simultaneously obtain, from the server, a video stream of each first client after the current time (the time of joining the real-time communication), and the processes of receiving the video streams of different first clients, parsing the video streams, and playing the corresponding videos are independent.

The video to be decoded may be a video of a target client among the plurality of first clients, and the second device may receive the video to be decoded (video stream) transmitted by the target client, where the video to be decoded is a video starting from a current time in the video corresponding to the target client. Since the target client is already in the real-time call before the second client joins, the video corresponding to the target client is not less than the video to be decoded.

It should be noted that, since encoding and decoding are performed according to the image group, for a client, at least a current image group (or a corresponding video code stream) corresponding to the client may be stored at the acquisition side, the server side and/or the playing side, or at least a current image group corresponding to the client and a previous image group of the current image group are stored, or at least an image frame in a refresh cycle of the current image group is stored, which is not limited in this embodiment.

After the video to be decoded is obtained by decoding, the second device may display (play) the decoded video in a target area of the real-time communication interface, and display areas of the video decoded by different first clients in the real-time communication interface may be non-overlapping.

It should be noted that, since the encoding and decoding are performed according to the image frames, after the decoding obtains each image frame of the video to be decoded, the second client may sequentially play each image frame according to the playing sequence of the image frames. Decoding the video stream and playing the video may be performed simultaneously, that is, after one image frame is decoded, if the image frame to be currently played in the playing order is the image frame, the image frame may be played.

By the embodiment, when the existing real-time communication is added, the video stream of the opposite terminal after the adding moment is received, so that the network resource occupation can be reduced, the coding and decoding efficiency is improved, and the timeliness of audio and video communication is further ensured.

As an optional embodiment, after detecting the join operation performed on the second client, the method further includes:

s101, acquiring a target reference frame corresponding to a video to be decoded, wherein the target reference frame is an image frame which is located before a starting frame in an image group where the starting frame of the video to be decoded is located and is referred to by at least one image frame in the video to be decoded.

Because the second client is accessed to the real-time call at a certain moment, considering the adopted coding mode, except the first GOP, when decoding each frame in the rest GOPs, one or some image frames before the initial frame of the video to be decoded can be used.

For example, if the starting frame belongs to an intra refresh portion, the dirty region therein is decoded with reference to at least a frame in a GOP previous to the current GOP; the clean regions are decoded with reference to at least the frame immediately preceding the starting frame in the current GOP (or some region or regions in the immediately preceding frame). If the starting frame does not belong to the intra refresh portion, the starting frame is decoded with reference to at least a non-dirty region of the intra refresh portion of the current GOP.

Because the image frames in the rest GOPs except the first GOP have reference relations with the image frames in the previous GOP, when the image frame in a certain GOP in the middle is decoded, all the image frames from the first GOP to the current time need to be acquired and decoded, which causes great delay.

Optionally, in this embodiment, in order to reduce decoding delay and improve timeliness of audio/video playing, the second client may perform decoding only based on an image frame in an image group where a start frame of a video to be decoded is located, and may not perform decoding for a dirty region therein.

If the starting frame of the video to be decoded is the first image frame of the image group, the second client can directly start decoding from the starting frame without referring to other frames. For example, as shown in fig. 6, the starting frame is the 1 st frame, the 1 st block of the 1 st frame is an intra-coded block, and the decoder can directly perform intra-decoding; the 2 nd, 3 rd and 4 th blocks are dirty regions, which need to be referred to a certain frame in the previous group of pictures and may not be decoded by the decoder.

If the starting frame of the video to be decoded is not the first image frame of the image group in which the starting frame is located, the second client may obtain the target reference frame corresponding to the video to be decoded from the server side, where the target reference frame and the starting frame are located in the same image group and are image frames which are before the starting frame and are referred to by at least one image frame in the video to be decoded, the target reference frame may be an undecoded video stream, and the second client may obtain the decoded image frame by decoding the target reference frame. The target reference frame may be a complete image frame or may be a portion corresponding to a non-dirty region in the target image frame.

For example, as shown in fig. 6, the starting frame is the 3 rd frame, the 1 st block and the 2 nd block of the 3 rd frame are clean areas, and decoding needs to be performed with reference to the 1 st block of the 1 st frame and the 2 nd block and the 3 rd block of the 2 nd frame, so the decoding end can acquire the video streams of the 1 st frame and the 2 nd frame, or at least acquire the video streams of the 1 st block of the 1 st frame and the 1 st block and the 2 nd block of the 2 nd frame.

The target reference frame and the video to be decoded may be received simultaneously (for example, the target reference frame actively pushed by the server), or may be received successively (for example, the second device actively acquires the target reference frame after determining the target reference frame), which is not limited in this embodiment.

It should be noted that although the image group on the encoding side and the image group on the decoding side correspond to each other, after the encoding, transmission and decoding processes, the image frame on the encoding side and the image frame corresponding to the decoding side are matched with each other, but the same is not necessarily achieved.

By the embodiment, the accuracy of video encoding and decoding can be ensured by acquiring the reference frame having the reference relation with the video to be decoded, and meanwhile, the occupation of network resources can be reduced (only the image frames having the reference relation in the same image group are transmitted, and the data volume of data transmission to the client is reduced).

The video encoding method and the video decoding method in the embodiments of the present application are explained below with reference to alternative examples.

The low-delay coding scheme provided in this example is an RTC-oriented low-delay coding scheme, which reduces the delay of the acquisition end, the transmission end, and the playing end by optimizing the coding configuration, so as to reduce the delay of the full link (for example, modules such as audio/video codec, server transmission and interaction, and anti-jitter buffer in fig. 8).

In this example, an ultra-low delay reference frame mode is provided by changing the relation of encoded reference frames, and the parallelism of encoding and decoding is improved, so as to reduce the encoding delay of the acquisition end and the decoding delay of the playing end; the ultra-low delay reference frame scheme is combined with periodic intra-frame refreshing to reduce the code rate of the I frame, reduce transmission delay caused by overlarge I frame, and reduce the code rate fluctuation of the I frame and the P frame, thereby reducing the buffering delay of a playing end.

As shown in fig. 6, the GOP may be divided into an intra refresh portion and other portions. Each frame of the intra refresh part is encoded in a low delay manner, for example, frame 3 can refer to both frame 2 and frame 1 (it is necessary to satisfy that a clean area can only refer to a previous non-dirty area). The frame after the intra refresh part is other part, and adopts ultra-low delay coding mode, each frame does not refer to the previous frame, but only refers to the non-dirty area of the intra refresh part (only refer to the 4 th frame, or simultaneously refer to the 1-4 frames, but only refer to the non-dirty area) for coding.

When the intra refresh is combined with the ultra-low delay reference, the delay of the video coding and decoding can be reduced, the size of the I frame can be reduced, the code rate fluctuation of the I frame and the P frame can be reduced, and the transmission delay and the buffering delay can be reduced.

It should be noted that the completion period of the intra refresh is related to the overall delay. The longer the period, the closer the coding delay approaches the LDP low delay mode, the larger the delay is relatively, but the closer the code rate of the I frame and the P frame is, the smaller the subsequent transmission delay and playback delay. Accordingly, if the period is shorter, the encoding delay is lower, but the subsequent transmission delay and playback delay may be larger. The refresh completion period is usually set to 2-5 according to an empirical value, which can also be adaptively adjusted according to the actual situation of the whole link of the RTC, the content or complexity of the video to be encoded, or the interval of GOPs in the actual encoding scene.

The intra refresh direction may be adaptively adjusted according to the video content, and may be vertical (as shown in fig. 5 and 6), horizontal, or irregular non-rectangular area.

As shown in fig. 9, the flow of the low-delay encoding method provided in this example may include the following steps:

step S902, a real-time communication connection between a first client of the first device and a second client of the second device is established.

And step S904, coding the audio and video collected by the collection end to obtain a corresponding video code stream, and transmitting the video code stream to the playing end through communication connection.

In the process of real-time communication, any one of the first device and the second device can be used as a collecting end and a playing end. When the first device is used as the acquisition end, the second device can be used as the playing end, and when the second device is used as the acquisition end, the first device can be used as the playing end.

For the acquisition end, the acquisition end can acquire audio and video to obtain corresponding audio and video. After certain data preprocessing is carried out, the acquisition end can carry out audio and video coding, a coding component can be a coder, and the audio and video coding can be carried out based on an image group.

The encoder may encode the video to be encoded by using the encoding method shown in fig. 6 to obtain a corresponding video stream, where the video stream may carry reference indication information of a reference frame of the video frame. After each video frame of an image group is encoded to obtain a corresponding video stream, the acquisition end can transmit the obtained video stream to the playing end through real-time communication connection.

Step S906, the playing end may decode the received video stream to obtain a corresponding audio/video, and play the audio/video obtained by decoding through the player.

After receiving the video stream transmitted by the opposite end, the playing end can determine the reference frame of each video frame according to the reference indication information in the video stream, decode the video stream according to the reference relation and the corresponding decoding mode to obtain the corresponding audio and video, and play the audio and video obtained by decoding through the player.

By the example, the encoding configuration is optimized, so that the encoding and decoding delay is reduced, the transmission and decoding buffering delay is reduced, the delay of the acquisition end, the transmission end and the playing end of the RTC full link is reduced comprehensively, and the requirements of low-delay scenes such as the RTC can be met.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., a ROM (Read-Only Memory)/RAM (Random Access Memory), a magnetic disk, an optical disk) and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the methods according to the embodiments of the present application.

According to still another aspect of embodiments of the present application, there is provided a video encoding apparatus for implementing the above-described video encoding method. Fig. 10 is a block diagram of an alternative video encoding apparatus according to an embodiment of the present application, and as shown in fig. 10, the apparatus may include:

an obtaining unit 1002, configured to obtain a group of pictures to be encoded of a video to be encoded, where the group of pictures to be encoded includes a first group of sub-pictures to be encoded and a second group of sub-pictures to be encoded;

a first encoding unit 1004, connected to the obtaining unit 1002, configured to perform intra refresh encoding on a first sub-image group to be encoded, where each first sub-image to be encoded in the first sub-image group to be encoded is divided into multiple encoding regions, all the encoding regions of the first sub-image group to be encoded include a dirty region and a non-dirty region, and the dirty region is a region encoded by referring to a region in another image group other than the image group to which the dirty region belongs;

a second encoding unit 1006, connected to the first encoding unit 1004, is configured to encode the second sub-group of pictures to be encoded, where each second sub-picture to be encoded in the second sub-group of pictures is encoded with reference to only a non-dirty region in the first sub-group of pictures to be encoded.

It should be noted that the obtaining unit 1002 in this embodiment may be configured to perform the step S402, the first encoding unit 1004 in this embodiment may be configured to perform the step S404, and the second encoding unit 1006 in this embodiment may be configured to perform the step S406.

Through the module, an image group to be coded of a video to be coded is obtained, wherein the image group to be coded comprises a first sub image group to be coded and a second sub image group to be coded; performing intra-frame refreshing coding on a first sub-image group to be coded, wherein each first sub-image to be coded in the first sub-image group to be coded is divided into a plurality of coding regions, all the coding regions of the first sub-image group to be coded comprise dirty regions and non-dirty regions, and the dirty regions are regions coded by referring to regions in other image groups except the image group to which the dirty regions belong; and coding the second sub image group to be coded, wherein each second sub image to be coded in the second sub image group to be coded only refers to a non-dirty area in the first sub image group to be coded, so that the problem of real-time audio and video communication timeliness caused by overlarge time delay in a coding and decoding mode in the related technology is solved, transmission delay and buffering delay are reduced, and the audio and video communication timeliness is improved.

As an alternative embodiment, the first encoding unit 1004 includes:

the first encoding module is used for carrying out intra-frame encoding on an intra-frame encoding area in a first image to be encoded, wherein the first image to be encoded is a sub-image in a first sub-image group to be encoded;

the second coding module is used for coding the dirty area in the first image to be coded by taking a first coding reference frame in a previous image group of the image group to be coded as a reference under the condition that the first image to be coded contains the dirty area and the first image to be coded is the first image of the first sub image group to be coded;

and the third encoding module is used for encoding the dirty area in the first image to be encoded by taking a previous image of the first image to be encoded and a second encoding reference frame in a previous image group of the image group to be encoded as references under the condition that the first image to be encoded contains the dirty area and the first image to be encoded is not the first image of the first sub image group to be encoded.

As an alternative embodiment, the first encoding unit 1004 further includes:

and the fourth encoding module is used for encoding the clean area in the first image to be encoded by taking the non-dirty area in the first image of the first sub image group to be encoded and the non-dirty area in the previous image of the first image to be encoded as references under the condition that the first image to be encoded contains the clean area.

As an alternative embodiment, the first encoding unit 1004 includes:

and the fifth encoding module is used for carrying out parallel encoding on a plurality of encoding areas contained in the second image to be encoded, wherein the second image to be encoded is a sub-image in the first sub-image group to be encoded.

Optionally, in this embodiment, the video to be encoded is a video corresponding to the first client in real-time communication.

As an alternative embodiment, the apparatus further comprises:

the transmission unit is used for transmitting a target video stream to a second client through real-time communication connection between a first client and the second client after acquiring the to-be-encoded image group of the to-be-encoded video, wherein the target video stream comprises a sub-video stream obtained by encoding the first to-be-encoded sub-image group and a sub-video stream obtained by encoding the second to-be-encoded sub-image group.

According to still another aspect of embodiments of the present application, there is provided a video decoding apparatus for implementing the above-described video decoding method. Fig. 11 is a block diagram illustrating an alternative video decoding apparatus according to an embodiment of the present application, where as shown in fig. 11, the apparatus may include:

a first obtaining unit 1102, configured to obtain a to-be-decoded image group of a to-be-decoded video, where the to-be-decoded image group includes a first to-be-decoded sub image group and a second to-be-decoded sub image group;

a first decoding unit 1104, connected to the first obtaining unit 1102, configured to perform intra refresh decoding on the first sub image group to be decoded, where each of the first sub image group to be decoded in the first sub image group to be decoded is divided into multiple decoding regions, all the decoding regions of the first sub image group to be decoded include a dirty region and a non-dirty region, and the dirty region is a region to be decoded with reference to a region in another image group other than the image group to which the reference belongs;

the second decoding unit 1006, connected to the first decoding unit 1104, is configured to decode the second sub image group to be decoded, where each second sub image to be decoded in the second sub image group to be decoded refers to only a non-dirty region in the first sub image group to be decoded for encoding.

It should be noted that the first obtaining unit 1102 in this embodiment may be configured to execute the step S702, the first decoding unit 1104 in this embodiment may be configured to execute the step S704, and the second decoding unit 1006 in this embodiment may be configured to execute the step S706.

Acquiring an image group to be decoded of a video to be decoded by the module, wherein the image group to be decoded comprises a first sub image group to be decoded and a second sub image group to be decoded; performing intra-frame refreshing decoding on a first to-be-decoded sub image group, wherein each first to-be-decoded sub image in the first to-be-decoded sub image group is divided into a plurality of decoding areas, all the decoding areas of the first to-be-decoded sub image group contain dirty areas and non-dirty areas, and the dirty areas are areas which are decoded by referring to areas in other image groups except the image group to which the dirty areas belong; and decoding the second sub image group to be decoded, wherein each second sub image to be decoded in the second sub image group to be decoded only refers to a non-dirty area in the first sub image group to be decoded for encoding, so that the problem of real-time audio and video communication timeliness caused by overlarge time delay in a coding and decoding mode in the related technology is solved, transmission delay and buffering delay are reduced, and the audio and video communication timeliness is improved.

As an alternative embodiment, the first decoding unit 1104 includes:

the first decoding module is used for carrying out intra-frame decoding on an intra-frame decoding area in a first image to be decoded, wherein the first image to be decoded is a sub-image in a first sub-image group to be decoded;

the second decoding module is used for taking a first decoding reference frame in a previous image group of the image group to be decoded as a reference and decoding the dirty area in the first image to be decoded under the condition that the first image to be decoded contains the dirty area and the first image to be decoded is the first image of the first sub image group to be decoded;

and the third decoding module is used for decoding the dirty area in the first image to be decoded by taking a previous image of the first image to be decoded and a second decoding reference frame in the previous image group of the image group to be decoded as references under the condition that the first image to be decoded contains the dirty area and the first image to be decoded is not the first image of the first sub image group to be decoded.

As an alternative embodiment, the first decoding unit 1104 further includes:

and the fourth decoding module is used for decoding the clean area in the first image to be decoded by taking the non-dirty area in the first image of the first sub image group to be decoded and the non-dirty area in the previous image of the first image to be decoded as references under the condition that the first image to be decoded contains the clean area.

As an alternative embodiment, the first decoding unit 1104 includes:

and the fifth decoding module is used for decoding a plurality of decoding areas contained in the second image to be decoded in parallel, wherein the second image to be decoded is a sub-image in the first sub-image group to be decoded.

Optionally, in this embodiment, the video to be decoded is a video corresponding to the first client in real-time communication.

As an alternative embodiment, the first obtaining unit 1102 includes:

the acquisition module is used for acquiring the image group to be decoded, which is transmitted by the first client through the real-time communication connection between the first client and the second client.

As an alternative embodiment, the obtaining module includes:

the detection submodule is used for detecting joining operation executed on the second client, wherein the joining operation is used for joining real-time communication among the plurality of first clients;

and the receiving submodule is used for responding to the joining operation and receiving the video to be decoded transmitted by a target client in the plurality of first clients, wherein the video to be decoded is the video starting from the current moment in the video corresponding to the target client.

As an alternative embodiment, the apparatus further comprises:

and the second acquisition unit is used for acquiring a target reference frame corresponding to the video to be decoded after the joining operation executed on the second client is detected, wherein the target reference frame is an image frame which is positioned before the starting frame and is referred to by at least one image frame in the video to be decoded in the image group where the starting frame of the video to be decoded is positioned.

According to another aspect of the embodiments of the present application, there is also provided a real-time communication system, including: the video encoding device comprises a first device running a first client, and a second device running a second client, wherein the first device and the second device are connected through a real-time communication connection, wherein the first device may comprise any one of the video encoding devices provided in the embodiments of the present application (or the first device is the video encoding device), and the second device may comprise any one of the video decoding devices provided in the embodiments of the present application (or the second device is the video decoding device).

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules described above as a part of the apparatus may be run in a hardware environment as shown in fig. 3, may be implemented by software, and may also be implemented by hardware, where the hardware environment includes a network environment.

According to yet another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the above-mentioned video encoding method and/or video decoding method, which may be a server, a terminal, or a combination thereof.

Fig. 12 is a block diagram of an alternative electronic device according to an embodiment of the present disclosure, as shown in fig. 12, including a processor 1202, a communication interface 1204, a memory 1206, and a communication bus 1208, where the processor 1202, the communication interface 1204, and the memory 1206 communicate with each other via the communication bus 1208, where,

a memory 1206 for storing a computer program;

the processor 1202, when executing the computer program stored in the memory 1206, performs the following steps:

s1, acquiring a to-be-encoded image group of a to-be-encoded video, wherein the to-be-encoded image group comprises a first to-be-encoded sub image group and a second to-be-encoded sub image group;

s2, performing intra refresh encoding on the first sub image group to be encoded, wherein each first sub image to be encoded in the first sub image group to be encoded is divided into a plurality of encoding regions, all the encoding regions of the first sub image group to be encoded include a dirty region and a non-dirty region, and the dirty region is a region encoded with reference to a region in another image group other than the image group to which the dirty region belongs;

s3, encoding the second sub-picture group to be encoded, wherein each second sub-picture to be encoded in the second sub-picture group to be encoded refers to only the non-dirty region in the first sub-picture group to be encoded.

Optionally, the processor 1202, when executing the computer program stored in the memory 1206, implements the following steps:

s1, acquiring a to-be-decoded image group of a to-be-decoded video, wherein the to-be-decoded image group comprises a first to-be-decoded sub image group and a second to-be-decoded sub image group;

s2, performing intra refresh decoding on the first sub image group to be decoded, wherein each of the first sub image group to be decoded in the first sub image group to be decoded is divided into a plurality of decoding areas, all the decoding areas of the first sub image group to be decoded include a dirty area and a non-dirty area, and the dirty area is an area to be decoded with reference to an area in another image group other than the image group to which the reference belongs;

s3, decoding the second sub image group to be decoded, wherein each second sub image to be decoded in the second sub image group to be decoded refers to only the non-dirty region in the first sub image group to be decoded for encoding.

Alternatively, in this embodiment, the communication bus may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 12, but this is not intended to represent only one bus or type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The memory may include RAM, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

As an example, the memory 1202 may include, but is not limited to, the obtaining unit 1002, the first encoding unit 1004, and the second encoding unit 1006 of the video encoding apparatus. In addition, the video encoding apparatus may further include, but is not limited to, other module units in the video encoding apparatus, which is not described in this example again.

As another example, the memory 1202 may include, but is not limited to, the first obtaining unit 1102, the second decoding unit 1104 and the second decoding unit 1106 of the video decoding apparatus. In addition, the video decoding apparatus may further include, but is not limited to, other module units in the video decoding apparatus, which is not described in this example again.

The processor may be a general-purpose processor, and may include but is not limited to: a CPU (Central Processing Unit), an NP (Network Processor), and the like; but also a DSP (Digital Signal Processing), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.

It can be understood by those skilled in the art that the structure shown in fig. 12 is only an illustration, and the device implementing the video encoding method and/or the video decoding method may be a terminal device, and the terminal device may be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a Mobile Internet Device (MID), a PAD, and the like. Fig. 12 is a diagram illustrating a structure of the electronic device. For example, the terminal device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 12, or have a different configuration than shown in fig. 12.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.

According to still another aspect of an embodiment of the present application, there is also provided a storage medium. Alternatively, in this embodiment, the storage medium may be a program code for executing a method for device screen projection.

Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the above embodiment.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:

Optionally, the specific example in this embodiment may refer to the example described in the above embodiment, which is not described again in this embodiment.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, a ROM, a RAM, a removable hard disk, a magnetic disk, or an optical disk.

According to yet another aspect of an embodiment of the present application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium; the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method steps of any of the embodiments described above.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the method described in the embodiments of the present application.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, and may also be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution provided in the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A video encoding method, comprising:

acquiring a to-be-encoded image group of a to-be-encoded video, wherein the to-be-encoded image group comprises a first to-be-encoded sub image group and a second to-be-encoded sub image group;

performing intra-frame refreshing coding on the first sub-image group to be coded, wherein each first sub-image to be coded in the first sub-image group to be coded is divided into a plurality of coding regions, all the coding regions of the first sub-image group to be coded comprise a dirty region and a non-dirty region, and the dirty region is a region coded by referring to regions in other image groups except the image group to which the dirty region belongs;

encoding the second sub image group to be encoded, wherein each second sub image to be encoded in the second sub image group to be encoded only refers to the non-dirty region in the first sub image group to be encoded;

wherein the intra refresh coding the first sub image group to be coded includes: carrying out intra-frame coding on an intra-frame coding area in a first image to be coded, wherein the first image to be coded is a sub-image in the first sub-image group to be coded; when the first image to be coded contains the dirty region and the first image to be coded is the first image of the first sub image group to be coded, coding the dirty region in the first image to be coded by taking a first coding reference frame in a previous image group of the image group to be coded as a reference; and under the condition that the first image to be coded contains the dirty area and the first image to be coded is not the first image of the first sub image group to be coded, coding the dirty area in the first image to be coded by taking a previous image of the first image to be coded and a second coding reference frame in the previous image group of the image group to be coded as references.

2. The method according to claim 1, wherein said intra refresh encoding said first group of sub-pictures to be encoded further comprises:

and under the condition that the first image to be coded contains a clean area, taking the non-dirty area in the first image of the first sub image group to be coded and the non-dirty area in the previous image of the first image to be coded as references, and coding the clean area in the first image to be coded.

3. The method according to claim 1, wherein said encoding said first group of sub-pictures to be encoded comprises:

and carrying out parallel coding on a plurality of coding areas contained in a second image to be coded, wherein the second image to be coded is a sub-image in the first sub-image group to be coded.

4. The method according to any one of claims 1 to 3, wherein the video to be encoded is a video corresponding to a first client in real-time communication, and after the acquiring the group of pictures to be encoded of the video to be encoded, the method further comprises:

and transmitting a target video stream to a second client through a real-time communication connection between the first client and the second client, wherein the target video stream comprises a sub-video stream obtained by encoding the first sub-image group to be encoded and a sub-video stream obtained by encoding the second sub-image group to be encoded.

5. A video decoding method, comprising:

acquiring an image group to be decoded of a video to be decoded, wherein the image group to be decoded comprises a first sub image group to be decoded and a second sub image group to be decoded;

performing intra-frame refresh decoding on the first sub image group to be decoded, wherein each first sub image to be decoded in the first sub image group to be decoded is divided into a plurality of decoding areas, all the decoding areas of the first sub image group to be decoded include a dirty area and a non-dirty area, and the dirty area is an area decoded by referring to areas in other image groups except the image group to which the dirty area belongs;

decoding the second sub image group to be decoded, wherein each second sub image to be decoded in the second sub image group to be decoded only refers to the non-dirty area in the first sub image group to be decoded for encoding;

wherein the intra refresh decoding of the first sub image group to be decoded includes: performing intra-frame decoding on an intra-frame decoding area in a first image to be decoded, wherein the first image to be decoded is a sub-image in the first sub-image group to be decoded; when the first image to be decoded contains the dirty region and the first image to be decoded is the first image of the first sub image group to be decoded, decoding the dirty region in the first image to be decoded by taking the first decoding reference frame in the previous image group of the image group to be decoded as a reference; and under the condition that the first image to be decoded contains the dirty area and the first image to be decoded is not the first image of the first sub image group to be decoded, decoding the dirty area in the first image to be decoded by taking a previous image of the first image to be decoded and a second decoding reference frame in the previous image group of the image group to be decoded as references.

6. The method according to claim 5, wherein said intra refresh decoding said first group of sub-pictures to be decoded further comprises:

and under the condition that the first image to be decoded contains a clean area, taking the non-dirty area in the first image of the first sub image group to be decoded and the non-dirty area in the previous image of the first image to be decoded as references, and decoding the clean area in the first image to be decoded.

7. The method according to claim 5, wherein said decoding said first group of sub-pictures to be decoded comprises:

and decoding a plurality of decoding areas contained in a second image to be decoded in parallel, wherein the second image to be decoded is a sub-image in the first sub-image group to be decoded.

8. The method according to any one of claims 5 to 7, wherein the video to be decoded is a video corresponding to a first client in real-time communication, and the obtaining the group of pictures to be decoded of the video to be decoded comprises:

and acquiring the video to be decoded transmitted by the first client, wherein the video to be decoded is transmitted through real-time communication connection between the first client and a second client, and the video to be decoded comprises the image group to be decoded.

9. The method according to claim 8, wherein said obtaining the video to be decoded transmitted by the first client comprises:

detecting a joining operation executed on the second client, wherein the joining operation is used for joining real-time communication among a plurality of first clients;

and responding to the joining operation, and receiving the video to be decoded transmitted by a target client in the plurality of first clients, wherein the video to be decoded is a video starting from the current moment in the video corresponding to the target client.

10. The method of claim 9, wherein after the detecting the join operation performed on the second client, the method further comprises:

and acquiring a target reference frame corresponding to the video to be decoded, wherein the target reference frame is an image frame which is positioned before a starting frame of the video to be decoded in an image group where the starting frame is positioned and is referred to by at least one image frame in the video to be decoded.

11. A video encoding apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a to-be-encoded image group of a to-be-encoded video, and the to-be-encoded image group comprises a first to-be-encoded sub image group and a second to-be-encoded sub image group;

the first encoding unit is used for carrying out intra-refresh encoding on the first sub-image group to be encoded, wherein each first sub-image to be encoded in the first sub-image group to be encoded is divided into a plurality of encoding regions, all the encoding regions of the first sub-image group to be encoded comprise dirty regions and non-dirty regions, and the dirty regions are regions which are encoded by referring to regions in other image groups except the image group to which the dirty regions belong;

a second encoding unit, configured to encode the second sub image group to be encoded, where each second sub image to be encoded in the second sub image group to be encoded refers to only the non-dirty region in the first sub image group to be encoded for encoding;

wherein the first encoding unit includes: the first encoding module is used for carrying out intra-frame encoding on an intra-frame encoding area in a first image to be encoded, wherein the first image to be encoded is a sub-image in the first sub-image group to be encoded; a second encoding module, configured to, when the first image to be encoded contains the dirty region and the first image to be encoded is a first image of the first sub image group to be encoded, encode the dirty region in the first image to be encoded by using a first encoding reference frame in a previous image group of the image group to be encoded as a reference; and a third encoding module, configured to, when the first image to be encoded contains the dirty region and the first image to be encoded is not the first image of the first sub image group to be encoded, encode the dirty region in the first image to be encoded with reference to a previous image of the first image to be encoded and a second encoding reference frame in a previous image group of the image group to be encoded.

12. A video decoding apparatus, comprising:

the device comprises a first acquisition unit, a second acquisition unit and a decoding unit, wherein the first acquisition unit is used for acquiring a to-be-decoded image group of a to-be-decoded video, and the to-be-decoded image group comprises a first to-be-decoded sub image group and a second to-be-decoded sub image group;

the first decoding unit is used for performing intra refresh decoding on the first sub image group to be decoded, wherein each first sub image to be decoded in the first sub image group to be decoded is divided into a plurality of decoding areas, all the decoding areas of the first sub image group to be decoded contain dirty areas and non-dirty areas, and the dirty areas are areas which are decoded by referring to areas in other image groups except the image group to which the dirty areas belong;

a second decoding unit, configured to decode the second sub image group to be decoded, where each second sub image to be decoded in the second sub image group to be decoded refers to only the non-dirty region in the first sub image group to be decoded for encoding;

wherein the first decoding unit includes: the first decoding module is used for carrying out intra-frame decoding on an intra-frame decoding area in a first image to be decoded, wherein the first image to be decoded is a sub-image in the first sub-image group to be decoded; a second decoding module, configured to, when the first image to be decoded includes the dirty region and the first image to be decoded is a first image of the first sub image group to be decoded, use a first decoding reference frame in a previous image group of the image group to be decoded as a reference to decode the dirty region in the first image to be decoded; and a third decoding module, configured to, when the first image to be decoded contains the dirty region and the first image to be decoded is not the first image of the first sub image group to be decoded, decode the dirty region in the first image to be decoded by using a previous image of the first image to be decoded and a second decoded reference frame in a previous image group of the image group to be decoded as references.

13. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein said processor, said communication interface and said memory communicate with each other via said communication bus,

the memory for storing a computer program;

the processor configured to perform the method steps of any one of claims 1 to 4, or to perform the method steps of any one of claims 5 to 10, by running the computer program stored on the memory.

14. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method steps of any of claims 1 to 4 or the method steps of any of claims 5 to 10 when executed.