CN113259673B

CN113259673B - Scalable video coding method, apparatus, device and storage medium

Info

Publication number: CN113259673B
Application number: CN202110755288.3A
Authority: CN
Inventors: 焦华龙
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2021-10-15
Anticipated expiration: 2041-07-05
Also published as: CN113259673A

Abstract

The application provides a method, a device, equipment and a storage medium for encoding a telescopic video, wherein the method comprises the following steps: acquiring a video sequence; acquiring the code rate ratio of the time domain layering layer number N and the N-1 layer of a video sequence in a time domain reference unit respectively, wherein N is an integer greater than 1; determining a time domain reference structure of a video sequence according to the number of time domain layering layers and the code rate ratio of each N-1 layer in a time domain reference unit; and coding the video sequence according to the time domain reference structure to obtain an output code stream. Video compression efficiency and video fluency can thus be balanced.

Description

Scalable video coding method, apparatus, device and storage medium

Technical Field

Embodiments of the present disclosure relate to video processing technologies, and in particular, to a scalable video coding method, apparatus, device, and storage medium.

Background

In a multi-party real-time video communication scene, different networks and different terminal devices exist, so that the requirements of the different networks or the different terminal devices on video quality are different, for example, when a network is used for transmitting video, because the network bandwidth limits video transmission, when the network bandwidth is smaller, only basic video signals are required to be transmitted, whether enhanced video signals are transmitted or not is determined according to the actual network condition, and the video quality is enhanced. Under the background, the scalable video coding technology is utilized to realize one-time coding to generate video compression code streams with different frame rates and resolutions, and then the amount of video information to be transmitted is selected according to different network bandwidths, different display screens and different terminal decoding capabilities, so as to realize the self-adaptive adjustment of the video quality. To enable an encoding technique for decoding video data of different frame rates, resolutions, and image qualities from a single code stream. Scalable Video Coding (SVC) is based on the h.264 Advanced Video Coding (AVC) standard and the h.265 High Efficiency Video Coding (HEVC), and utilizes various High Efficiency algorithm tools of the AVC and HEVC codecs, so that the encoded Video generated by encoding is temporally and spatially Scalable, and is Scalable in Video quality, and can generate videos with different frame rates, resolutions, or quality levels.

Scalability, i.e., scalability, of SVC mainly includes: temporal scalability, spatial scalability, and quality scalability, where temporal scalability is crucial. Currently, when temporal scalability of a video sequence is performed, a coding device uses a fixed number of temporal layers, and once the number of temporal layers is determined, the temporal reference structure is fixed. The current time domain reference structure mainly exchanges a long reference distance in the time domain for the anti-packet loss effect, however, when the reference distance is too large, the code rate of an image frame becomes large, that is, the number of consumed code words becomes large, and further the video compression efficiency is low, and when the reference distance is too small, although the number of consumed code words becomes small, the video compression efficiency is high, the packet loss situation is easy to occur, so that the video smoothness is poor, and therefore, how to balance the video compression efficiency and the video smoothness is a technical problem to be solved urgently in the present application.

Disclosure of Invention

The application provides a scalable video coding method, a scalable video coding device, a scalable video coding apparatus and a scalable video coding storage medium, so that video compression efficiency and video smoothness can be balanced.

In a first aspect, a scalable video coding method is provided, including: acquiring a video sequence; acquiring the code rate ratio of the time domain layering layer number N and the N-1 layer of a video sequence in a time domain reference unit respectively, wherein N is an integer greater than 1; determining a time domain reference structure of a video sequence according to the number of time domain layering layers and the code rate ratio of each N-1 layer in a time domain reference unit; and coding the video sequence according to the time domain reference structure to obtain an output code stream.

In a second aspect, a scalable video coding apparatus is provided, including: the device comprises a first acquisition module, a second acquisition module, a determination module and a coding module, wherein the first acquisition module is used for acquiring a video sequence; the second acquisition module is used for acquiring the code rate ratio of the time domain layering layer number N and the N-1 layer of the video sequence in a time domain reference unit; the determining module is used for determining a time domain reference structure of the video sequence according to the time domain layering layer number and the code rate ratio of each of the N-1 layers in one time domain reference unit; and the coding module is used for coding the video sequence according to the time domain reference structure so as to obtain an output code stream.

In a third aspect, a terminal device is provided, including: a processor and a memory, the memory being configured to store a computer program, the processor being configured to invoke and execute the computer program stored in the memory to perform a method as in the first aspect or its implementations.

In a fourth aspect, there is provided a computer readable storage medium for storing a computer program for causing a computer to perform the method as in the first aspect or its implementations.

In a fifth aspect, there is provided a computer program product comprising computer program instructions to cause a computer to perform the method as in the first aspect or its implementations.

A sixth aspect provides a computer program for causing a computer to perform a method as in the first aspect or implementations thereof.

Through the technical scheme of this application, terminal equipment can set up the time domain layering number of piles and the code rate of each layer and account for than, prescribe a limit to the code rate of each layer promptly, that is to say, to different terminal equipment or different networks that different terminal equipment adopted, can set up different time domain layering number of piles and the code rate of each layer and account for than, for example: for a terminal device with better performance, the code rate of the base layer can be set to be higher than the code rate of the base layer of the terminal device with poorer performance, and the terminal device with better performance can consume more code words, so that the video smoothness of the terminal device can be ensured. For a terminal device with poor performance, the ratio of the base layer code rate that can be set for the terminal device is lower than the ratio of the base layer code rate that is set for the terminal device with better performance, because the terminal device with better performance cannot consume more codewords, it is necessary to ensure that the compression efficiency of such a terminal device is lower, and in short, by limiting the ratio of the code rates of each layer, the video compression efficiency and the video smoothness can be balanced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a multi-party real-time video communication scenario provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a typical video encoder;

FIG. 3 is a schematic diagram of a 3-layer time domain reference structure;

fig. 4 is a flowchart of a scalable video coding method according to an embodiment of the present application;

fig. 5 is a schematic diagram of a time domain reference structure provided in an embodiment of the present application;

fig. 6 is a schematic diagram of another time domain reference structure provided in an embodiment of the present application;

FIG. 7 is a diagram illustrating a time-domain reference structure according to an embodiment of the present application;

FIG. 8 is a diagram illustrating a time-domain reference structure provided in an embodiment of the present application;

fig. 9 is a schematic diagram of an apparatus for scalable video coding according to an embodiment of the present application;

fig. 10 is a schematic block diagram of a terminal device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The application scenario of the technical solution of the present application may be a multi-party real-time video communication scenario, but is not limited thereto. Fig. 1 is a schematic view of a multi-party real-time video communication scenario provided in an embodiment of the present application, and as shown in fig. 1, a multi-party real-time video communication system includes: various terminal devices 110 and servers 120, wherein terminal devices 110 and servers 120 are connected via a network.

It should be understood that when each of the terminal devices 110 performs real-time video communication, any one of the terminal devices 110 performs one-time encoding by using a scalable video encoding technique to generate video compressed code streams with different frame rates and resolutions, and then sends the video compressed code streams to the server 120, and then the server 120 selects the amount of video information to be transmitted according to different network bandwidths, different display screens, and the decoding capabilities of other terminal devices, and sends corresponding video information to the other terminal devices.

It should be noted that the terminal devices 110 of each party may also directly transmit video information, for example: when each terminal device 110 performs real-time video communication, any terminal device 110 uses a scalable video coding technology to realize one-time coding to generate video compression code streams with different frame rates and resolutions, then selects the amount of video information to be transmitted according to different network bandwidths, different display screens and the decoding capabilities of other terminal devices, and sends corresponding video information to other terminal devices.

It should be understood that any of the terminal devices 110 described above may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a wearable device, and the like, and the present application is not limited thereto. The server 130 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, which is not limited in this application.

As described above, in a multi-party real-time video communication scenario, different networks and different terminal devices exist, so that the requirements of the different networks or different terminal devices on video quality are different, for example, when a network is used to transmit video, since the network bandwidth limits video transmission, when the network bandwidth is small, only a basic video signal is transmitted, and whether an enhanced video signal is transmitted or not is determined according to the actual network condition, so that the video quality is enhanced. Under the background, the scalable video coding technology is utilized to realize one-time coding to generate video compression code streams with different frame rates and resolutions, and then the amount of video information to be transmitted is selected according to different network bandwidths, different display screens and different terminal decoding capabilities, so as to realize the self-adaptive adjustment of the video quality. To enable an encoding technique for decoding video data of different frame rates, resolutions, and image qualities from a single code stream. SVC is based on the h.264 AVC standard, h.265 HEVC, and utilizes various efficient algorithm tools of AVC and HEVC codecs, and is scalable in terms of temporal, spatial, and video quality of coded video generated by coding, and can generate video of different frame rates, resolutions, or quality levels.

Scalability, i.e., scalability, of SVC mainly includes: temporal, spatial and quality scalability. Wherein, the time gradable representation subcode stream contains video information with reduced play frame rate. The spatially scalable representation of the sub-stream contains video information with reduced spatial resolution of the image. Quality scalable means that the sub-stream provides the same spatial resolution as the full stream, but at a lower quality.

It should be understood that the sub-streams mentioned above are streams corresponding to each temporal hierarchy, or streams corresponding to each spatial hierarchy, or streams corresponding to each quality hierarchy. The complete code stream is a code stream corresponding to the entire video sequence.

Video coding is a process of compressing an original video (i.e., the video sequence) into a minimum number of bits while ensuring that a decoded reconstructed video has a certain playing quality. Fig. 2 is a schematic diagram of a typical video encoder, and as shown in fig. 2, the typical video encoder is generally mainly divided into the following basic units, i.e., a time domain model, a spatial domain model and an entropy encoder. The time domain model inputs an uncompressed video sequence, and uses the correlation between adjacent image frames in the sequence to remove the time domain redundancy, which is usually implemented by establishing a predicted image for the current frame. The time domain model outputs a residual image obtained by the difference between the current image and the predicted image, and model parameters such as a motion vector and the like. The input of the spatial domain model is a reference image output in the front, and the function of the spatial domain model is to transform and quantize the pixel value of the residual image by utilizing the correlation of adjacent pixel points in the residual image and remove spatial redundancy. After quantization of the transform coefficients, only a small number of key coefficient values are retained as input data. The output parameters of the temporal and spatial domain models are compressed by an entropy coder to remove statistical redundancy in the data. The compressed encoded output mainly contains parameters such as encoded motion vectors, residuals and the like and header information.

The SVC encoder encodes video into a plurality of spatial layers, a plurality of temporal layers, and a plurality of quality layers. The temporal scalability technology can support multiple video playback frame rates through a single code stream, and a video stream supporting temporal scalability can be decomposed into a base layer and one or more enhancement layers in the temporal domain.

The current video coding structure may be: an All Intra (All Intra) coding structure, a Low Latency (LP) coding structure, and a Random Access (RA) coding structure.

In the LP coding structure, only the first frame image is coded in an intra-frame manner, and even if an Instantaneous Decoding Refresh (IDR) frame is formed, each subsequent frame is coded as a general P and B frame (GPB). The low-delay coding structure is further divided into two coding structures, namely LDB and LDP, according to whether the type of the subsequent frame is a B frame or a P frame. It should be noted that, at present, image frames are classified into three categories, I frames, P frames, and B frames, according to prediction types, where all coding units in an I frame are coded only by using intra-frame prediction. Coding units in P frames may be encoded using intra prediction or uni-directional inter prediction. Coding units in B frames may be encoded using intra prediction or bi-directional inter prediction. That is, a P frame has only one reference list, while a B frame has two reference lists.

When the LP coding structure is adopted, the playing sequence and the coding sequence of the video sequence are consistent. Therefore, each frame of image only refers to the reconstructed frame of which the playing sequence is before the current coding image, the video sequence is coded and decoded according to the playing sequence, the coding and decoding of the image of which the coding sequence is behind the current image and the playing sequence is in front are not required to be waited, the time delay is relatively smaller, and the low-delay structure is named from the low-delay structure, is mainly designed for interactive real-time communication and is suitable for scenes with higher time delay requirements, such as live broadcast, video call and the like.

As described above, SVC can solve the problem of inconsistent video quality due to different terminal devices and different network conditions, and this is most widely adopted in temporal layering in terms of implementation complexity, cost performance, and compatibility. The layering of a video sequence in the time domain is usually determined by a preset number of time domain layering layers, and then a coded video sequence is referred according to the time domain reference structure, so that the coded code stream is transmitted to a server or a receiving end, and the server performs secondary distribution or the receiving end performs selective decoding and other processing. It should be understood that the temporal reference structure here, i.e. the structure of the video sequence after temporal layering, is called a temporal reference structure because each image frame may be a reference image frame of other image frames. The quality of the selection of the time domain reference structure seriously affects the definition, the fluency and other key indexes of the video. Fig. 3 is a schematic diagram of a 3-layer temporal reference structure, where pn (m) denotes an image frame located at the nth layer and having a video timing of the mth frame, as shown in fig. 3. Except that the 0 th frame is an IDR frame, the other frames are P frames, that is, the frames all adopt a unidirectional prediction mode, and the direction pointed by the arrow in fig. 3 represents a reference image frame, for example: the reference image frame of P3 (1) is IDR (0), the 3-layer temporal reference structure is composed of a plurality of temporal reference units, and the structures of these temporal reference units are identical, as shown in fig. 3, IDR (0), P3 (1), P2 (2), and P3 (3) constitute one temporal reference unit, and P1 (4), P3 (5), P2 (6), and P3 (7) constitute another temporal reference unit, and the structures of these two temporal reference units are identical.

In a low-latency lossy network environment, only the image frame loss in the base layer, i.e., layer 1 in fig. 3, needs to wait for the next key frame (i.e., I frame) to arrive, while the image frames in other enhancement layers, e.g., layer 2 or layer 3, are lost and do not need to be retransmitted and wait for the next key frame. This property is inherently packet loss resistant, for example: in the traditional IPPP coding, a key frame is waited for any 1 frame loss, and when the image frames in the base layer are separated by N frames, the probability of the image frame loss in the base layer is 1/N of the image frame loss in the IPPP coding, that is, the temporal reference structure has a good anti-packet loss characteristic. When Forward Error Correction (FEC) protection is selectively performed, only the base layer needs to be performed, so that effective utilization of bandwidth is greatly increased, and delay caused by a large amount of retransmission is reduced. However, the current time domain reference structure mainly uses a longer reference distance in the time domain to replace the effect of packet loss resistance, and when the reference distance is too large, the code rate of an image frame becomes large, that is, the number of consumed code words becomes large, thereby causing the problem of low video compression efficiency. For example: when Quantization Parameter (QP) =30, the relationship of codewords consumed at different reference distances is shown in table 1:

TABLE 1

Reference distance	1	2	3	4	6	8
							1	1	1.436	1.719	1.974	2.404	2.704
2		1	1.197	1.374	1.652	1.864
							3			1	1.148	1.547	1.573
4				1	1.218	1.53

The horizontal and vertical terms in the table are reference distances, which form reference distance pairs, and the number corresponding to each reference distance pair represents the relationship between the consumed code words at two reference distances, for example: row 2, column 3 number 1.436 in table 1 indicates that the codeword consumed for a reference distance of 2 is 1.436 times the codeword consumed for a reference distance of 1. The row 2, column 4 number 1.719 of table 1 indicates that the codeword consumed for the reference distance of 3 is 1.719 times the codeword consumed for the reference distance of 1.

Assuming that the codeword consumed by each P frame is 1 in IPPP mode, 4P frames consume 4 codewords, wherein the reference distance between adjacent P frames is 1 in IPPP mode, and in summary, the 4P frames consume 4 codewords in IPPP mode 1= 4. Assume that a 3-layer reference structure as shown in fig. 3 is employed for these 4P frames, for example: the 4P frames are P1 (4), P3 (5), P2 (6), P3 (7) in fig. 3, and the reference image frame of P1 (4) is IDR (0), and the reference distance of P1 (4) to IDR (0) is 4-0= 4; the reference image frame of P3 (5) is P1 (4), the reference distance of P3 (5) from P1 (4) is 5-4= 1; the reference image frame of P2 (6) is P1 (4), the reference distance of P2 (6) from P1 (4) is 6-4= 2; the reference image frame of P3 (7) is P2 (6), and the reference distance of P3 (7) from P2 (6) is 7-6= 1. Referring to table 1, when the reference distance is 4, it consumes 1.974 times as many codewords as it consumes when the reference distance is 1. When the reference distance is 2, it consumes 1.436 times as many codewords as it consumes when the reference distance is 1. Then when the 3-layer reference structure shown in fig. 3 is adopted, the codewords consumed by the 4P frames are 1.974+1.436+2 × 1=5.41, and the codewords consumed by the 3-layer reference structure shown in fig. 3 are 5.41/4 ≈ 1.35 of the codewords consumed in IPPP mode, that is, 35% more codewords are consumed by SVC than IPPP mode, thereby causing a problem of low video compression efficiency.

When the reference distance is too small, although the consumed code words are reduced, the video compression efficiency is higher, the packet loss is easy to occur, so that the video smoothness is poorer, and therefore how to balance the video compression efficiency and the video smoothness is a technical problem to be solved urgently in the present application.

In order to solve the above technical problem, in the present application, the number of time domain layers and the bit rate ratio of each layer may be set, that is, the bit rate ratio of each layer is limited, so that the video compression efficiency and the video smoothness may be balanced.

It should be understood that the code rate is the number of data bits transmitted per unit time during data transmission, and the code rate is also called bit rate, which indicates how many bits are needed to represent the video data after compression coding per second, i.e. the amount of data after compressing the image displayed per second. Therefore, the consumed code words can also be viewed from the perspective of code rate, that is, the more code words consumed, the larger the code rate. Conversely, the fewer codewords consumed, the smaller the code rate.

The scheme of the application will be explained in detail as follows:

fig. 4 is a flowchart of a scalable video coding method provided by an embodiment of the present application, where the method may be performed by a coding device, and the coding device may be any terminal device shown in fig. 1, but is not limited thereto, and as shown in fig. 4, the method includes the following steps:

s410: a video sequence is acquired.

S420: the code rate ratio of the time domain layering layer number N and the N-1 layer of the video sequence in a time domain reference unit is obtained, wherein N is an integer larger than 1.

S430: and determining the time domain reference structure of the video sequence according to the time domain layering layer number and the code rate ratio of the N-1 layers in one time domain reference unit.

S440: and coding the video sequence according to the time domain reference structure to obtain an output code stream.

It should be understood that the video sequence may be captured by a camera of the terminal device, and the video sequence may be referred to as a video or an image sequence, which is not limited in this application.

As described above, in SVC technology, a terminal device may perform temporal classification on a video sequence, where the number of temporal classification layers may be 2 layers, 3 layers, or more, which is not limited in this application. For example: as shown in fig. 3, the video sequence has 3 temporal layering levels.

It should be understood that the time domain reference structure is composed of a plurality of time domain reference units, and the structures of the time domain reference units are identical, as shown in fig. 3, IDR (0), P3 (1), P2 (2), and P3 (3) constitute one time domain reference unit, and P1 (4), P3 (5), P2 (6), and P3 (7) constitute another time domain reference unit, and the structures of the two time domain reference units are identical.

Optionally, when the video sequence is divided into N layers in the time domain, the terminal device only needs to obtain a code rate ratio of any N-1 layers in the N time domain layers in one time domain reference unit, but is not limited thereto. For example: assuming that the video sequence is divided into N layers, the terminal device may obtain the ratio of the code rates of the layer 1, the layer 2 to the N-1 in one temporal reference unit, or the terminal device may obtain the ratio of the code rates of the layer 2, the layer 3 to the N in one temporal reference unit.

It should be understood that the terminal device may also obtain N layers, that is, the ratio of the code rates of all the time domain layers in one time domain reference unit, because the sum of the ratio of the code rates of the N layers in one time domain reference unit is 1, and it is the number of image frames corresponding to the i +1 th layer of each image frame in the i-th layer to be determined when the terminal device subsequently determines the time domain reference structure, that is, N-1 unknown quantities are determined, therefore, the N-1 unknown quantities may be determined only by satisfying N-1 conditions, and based on this, in practical application, the terminal device only needs to utilize the ratio of the code rates of any N-1 layers in the N layers in one time domain reference unit.

It should be understood that for any one of the above N-1 layers, the ratio of the code rates in one temporal reference unit refers to the ratio of the sum of the code rates of all image frames in the layer to the sum of the code rates of all image frames in all layers in the temporal reference unit.

For example, as shown in fig. 3, assuming that the code rate of each P frame is 1 in IPPP mode, the sum of the code rates of 4P frames is 4, wherein the reference distance between adjacent P frames is 1 in IPPP mode, and in summary, the sum of the code rates of the 4P frames is 4 × 1=4 in IPPP mode. Assume that a 3-layer reference structure as shown in fig. 3 is employed for these 4P frames, for example: the 4P frames are P1 (4), P3 (5), P2 (6), P3 (7) in fig. 3, and the reference image frame of P1 (4) is IDR (0), and the reference distance of P1 (4) to IDR (0) is 4-0= 4; the reference image frame of P3 (5) is P1 (4), the reference distance of P3 (5) from P1 (4) is 5-4= 1; the reference image frame of P2 (6) is P1 (4), the reference distance of P2 (6) from P1 (4) is 6-4= 2; the reference image frame of P3 (7) is P2 (6), and the reference distance of P3 (7) from P2 (6) is 7-6= 1. Referring to table 1, when the reference distance is 4, the code rate is 1.974 times the code rate when the reference distance is 1. When the reference distance is 2, its code rate is 1.436 times that when the reference distance is 1. Then when the 3-layer reference structure shown in fig. 3 is adopted, the sum of the code rates of the 4P frames is 1.974+1.436+2 × 1=5.41, and in the temporal reference unit formed by P1 (4), P3 (5), P2 (6) and P3 (7), the code rate of the 0 th layer is 1.974, and therefore, the code rate of the 0 th layer in the temporal reference unit is 1.974/5.41 ≈ 0.36.

It should be noted that, for different terminal devices, different time domain layering numbers and code rate ratios of the layers may be set, for example: for a terminal device with better performance, the ratio of the base layer code rate set for the terminal device can be higher than that of the base layer code rate of the terminal device with poorer performance, and because the terminal device with better performance can consume more code words, the video fluency of the terminal device can be ensured. For a terminal device with poor performance, the base layer bit rate that can be set for the terminal device is lower than the base layer bit rate that is set for the terminal device with better performance, because the terminal device with better performance cannot consume more codewords, it is necessary to ensure that the compression efficiency of such terminal device is lower.

Similarly, for different networks used by each terminal device, different code rate ratios may also be set for the terminal devices, for example: if the terminal device a uses a wireless fidelity (wireless fidelity) network and the terminal device B uses a 5G mobile network, a lower base layer bit rate ratio may be set for the terminal device a and a higher base layer bit rate ratio may be set for the terminal device B. It should be understood that, since the parameters mainly involved in a temporal reference structure are the number of temporal layering layers and the number of image frames corresponding to the i +1 th layer of each image frame in the i-th layer in any temporal reference unit in the temporal reference structure, i =1,2 … … N, where N represents the number of layers of the temporal reference structure, and the 1 st layer is the base layer. And the number of the time domain layering layers is configured, so that the terminal equipment determines that the number of the image frames corresponding to the (i + 1) th layer of each image frame in the ith layer in any one time domain reference unit is the greatest in determining the time domain reference structure. It should be understood that, for any image frame in the ith layer, the number of image frames corresponding to the image frame in the (i + 1) th layer refers to the number of image frames corresponding to the image frame in the (i + 1) th layer, that is, the number of image frames in the (i + 1) th layer that can be inserted between the image frame and the image frame subsequent to the image frame in the ith layer. For example, as in the temporal reference structure shown in fig. 3, P1 (4) is an image frame located at layer 1, the image frame following the image frame is P1 (8), and the image frame located at layer 2 and that can be inserted into P1 (4) and P1 (8) is P2 (6), and thus, the number of image frames corresponding to P1 (4) at layer 2 is 1.

Optionally, the terminal device may determine, in two realizable manners, the number of image frames corresponding to each image frame in the i-th layer in one time-domain reference unit in the i + 1-th layer, but is not limited to this: the implementation mode is as follows: and the terminal equipment determines the number of image frames corresponding to each image frame in the ith layer in the ith +1 layer in one time domain reference unit according to the number of time domain layering layers and the code rate ratio of the N-1 layers in one time domain reference unit. The second implementation mode: the terminal device determines the number of image frames corresponding to the (i + 1) th layer of each image frame in the ith layer in one time domain reference unit before determining the number of image frames corresponding to the (i + 1) th layer of each image frame in the ith layer in one time domain reference unit according to the number of time domain layering layers and the code rate ratio of the N-1 layers in the time domain reference unit, and the lower limit value of the code rate of each image frame in the N-1 layer can be obtained for the time domain reference unit, wherein the lower limit value of the code rate is the lower limit value of the code rate of each image frame relative to the image frame in the highest enhancement layer. Correspondingly, the terminal equipment can determine the number of image frames corresponding to each image frame in the ith layer in one time domain reference unit in the (i + 1) th layer according to the code rate lower limit value of each image frame in the N-1 layer, the number of time domain layering layers and the code rate ratio of the N-1 layers in one time domain reference unit.

That is, the limitation of the lower limit of the code rate is not involved in the first implementation, but is involved in the second implementation.

The following is a detailed description of the first implementation:

the terminal equipment is assumed to acquire the number N of time-domain layering layers and the ratio of code rates of the 1 st layer to the N-1 st layer in one time-domain reference unit, and the code rate of any image frame in the N-th layer is assumed to be 1, and the code rate of each image frame in the 1 st layer relative to any image frame in the N-th layer is assumed to be

The code rate of layer 1 in a time domain reference unit is

In the time domain reference unit, each image frame in the layer 1 corresponds to the image in the layer 2The number of frames is

Similarly, assume that each image frame in layer 2 is a code rate relative to any image frame in layer N is

The code rate of layer 2 in a time domain reference unit is

In the time domain reference unit, the number of image frames corresponding to each image frame in the layer 2 in the layer 3 is equal to

And so on, assuming that each image frame in the N-1 th layer is a code rate relative to any image frame in the N-1 th layer is

The code rate of the layer N-1 in a time domain reference unit is

In the time domain reference unit, the number of image frames corresponding to each image frame in the N-1 th layer in the N layer is

. Then it can be calculated by the following formula

，

……

：

Wherein the content of the first and second substances,

indicating the number of image frames included in the ith layer in the above time-domain reference unit,

。

it should be understood that for the above N formulas, it is known

，

……

Then it can be determined

，

……

To do so

，

……

As long as it satisfies more than 0.

The following detailed description is directed to implementation two:

it should be understood that the code rate lower limit of an image frame referred to in this application refers to a lower limit of the code rate of the image frame, and the code rate of the image frame is the code rate of the image frame relative to the image frame in the highest enhancement layer. Of course, the code rate may also be a code rate relative to other image frames, for example: the code rate is relative to the image frame in the second last layer, as long as the reference values relative to the code rates of all the image frames are uniform, which is not limited in the present application.

The code rate of layer 1 in a time domain reference unit is

In the time domain reference unit, the number of image frames corresponding to each image frame in the layer 1 in the layer 2 is equal to

The lower limit value of the code rate of each image frame in the layer 1 is

The code rate of layer 2 in a time domain reference unit is

The lower limit value of the code rate of each image frame in the layer 2 is

The code rate of the layer N-1 in a time domain reference unit is

The lower limit value of the code rate of each image frame in the N-1 layer is

. Then it can be calculated by the following formula

，

……

：

And the following conditions need to be satisfied:

。

wherein the content of the first and second substances,

。

it should be understood that for the above N formulas, it is known

，

……

Then it can be determined

，

……

To do so

，

……

As long as the above conditions are satisfied.

The following determination is made for the case where the number of time-domain layered layers is 2

The process of (a) is exemplified: wherein the content of the first and second substances,

is shown inThe number of image frames corresponding to each image frame in the layer 1 in the time domain reference unit in the layer 2 is as follows:

illustratively, it is assumed that the terminal device has acquired 2 temporal layering levels, and that the code rate of any image frame in the 2 nd level is 1, and the code rate of each image frame in the 1 st level relative to any image frame in the 2 nd level is

The code rate of layer 1 in a time domain reference unit is

The lower limit value of the code rate of each image frame in the layer 1 is

Then, it can be calculated by the following formula

：

Wherein, in the step (A),

namely, it is

Based on which it can be determined

And the following conditions need to be satisfied:

that is to say

Then, then

Due to the fact that

Take on values of integers and, therefore,

suppose that

Then, then

Wherein, in the step (A),

indicating a rounded up symbol.

The following determination is made for the case where the number of time-domain layered layers is 3

And

indicating the number of image frames corresponding to each image frame in layer 2 in layer 1 in the time domain reference unit,

representing the number of image frames corresponding to each image frame in the layer 2 in the time domain reference unit in the layer 3:

illustratively, assume that the terminal device acquires the number of time domain layering layers 3, andassuming that the code rate of any one image frame in the layer 3 is 1, the code rate of each image frame in the layer 1 relative to any one image frame in the layer 3 is

The code rate of layer 1 in a time domain reference unit is

The lower limit value of the code rate of each image frame in the layer 1 is

The code rate of each image frame in layer 2 relative to any image frame in layer 3 is

The code rate of layer 2 in a time domain reference unit is

The lower limit value of the code rate of each image frame in the layer 2 is

Then, it can be calculated by the following formula

And

：

wherein the content of the first and second substances,

namely, it is

And the following conditions need to be satisfied:

and therefore, the first and second electrodes are,

based on the above,

due to the fact that

And therefore, the first and second electrodes are,

and due to

And therefore, the first and second electrodes are,

therefore, it is

Finally, finally

。

Due to the fact that

And therefore, the first and second electrodes are,

substituting it into the formula

In (1) obtaining

Due to the fact that

I.e. by

And due to

Therefore, it is

Finally, finally

。

Suppose that

Then finally determined

，

。

Further, after the terminal device determines the number of image frames corresponding to the i +1 th layer of each image frame in the i-th layer in one temporal reference unit, the terminal device may determine the temporal reference structure.

Exemplarily, fig. 5 is a schematic diagram of a temporal reference structure provided by an embodiment of the present application, and as shown in fig. 5, the temporal reference structure includes a 3-layer temporal reference structure, and the number of image frames corresponding to each image frame in the layer 1 in a temporal reference unit at the layer 2 is 2, for example: the image frames of the layer 2 corresponding to the layer 1P 1 (6) are P2 (8) and P2 (10), the arrow direction refers to the reference image frame, such as the reference image frame of P2 (8) is P1 (6), and the number of image frames corresponding to the layer 3 in each image frame of the layer 2 in one temporal reference unit is 1, for example: the image frame of the 3 rd layer corresponding to the image frame P2 (8) of the 2 nd layer is P2 (9).

Exemplarily, fig. 6 is a schematic diagram of another time-domain reference structure provided by an embodiment of the present application, and as shown in fig. 6, the time-domain reference structure has a 2-layer time-domain reference structure, and the number of image frames corresponding to each image frame in the layer 1 in a time-domain reference unit at the layer 2 is 2, for example: the image frames of the 2 nd layer corresponding to the P1 (3) of the 1 st layer are P2 (4) and P2 (5), and the arrow direction refers to a reference image frame, such as the reference image frame of P2 (4) is P1 (3).

It should be noted that, the present application does not limit the terminal device to determine the time domain reference structure according to the number of image frames corresponding to the (i + 1) th layer of each image frame in the ith layer in one time domain reference unit.

Further, after determining the time-domain reference structure of the video sequence, the terminal device may determine whether intra-frame prediction or inter-frame prediction is specifically adopted for each image frame to perform image prediction, so as to obtain prediction information of a coding unit (i.e., an image block) in each image frame, and then subtract the prediction information from an original signal of the coding unit, so as to obtain a residual signal. After prediction, the amplitude of the residual signal is much smaller than that of the original signal, and transformation and quantization operations are further performed on the residual signal. And obtaining a transformation quantization coefficient after transformation quantization. And finally, coding the quantization coefficient and other indication information in the coding by an entropy coding technology to obtain a code stream.

To sum up, in the present application, the terminal device may set the number of time-domain layers and the ratio of code rates of each layer, that is, the ratio of code rates of each layer is defined, that is, different time-domain layers and ratios of code rates of each layer may be set for different terminal devices or different networks adopted by different terminal devices, for example: for a terminal device with better performance, the code rate of the base layer can be set to be higher than the code rate of the base layer of the terminal device with poorer performance, and the terminal device with better performance can consume more code words, so that the video smoothness of the terminal device can be ensured. For a terminal device with poor performance, the ratio of the base layer code rate that can be set for the terminal device is lower than the ratio of the base layer code rate that is set for the terminal device with better performance, because the terminal device with better performance cannot consume more codewords, it is necessary to ensure that the compression efficiency of such a terminal device is lower, and in short, by limiting the ratio of the code rates of each layer, the video compression efficiency and the video smoothness can be balanced. Furthermore, the method and the device can also specify a code rate lower limit value of the image frame, so that the video fluency can be further ensured.

It should be understood that in the actual image coding, the terminal device may code each image frame with a certain code rate to avoid the problem of low compression efficiency, and the determination of the code rate of each image frame will be described below:

alternatively, the terminal device may determine the code rate of each image frame in the N-1 layer according to the time domain reference structure and the ratio of the code rates of the N-1 layer in one time domain reference unit, where the code rate is the code rate of each image frame relative to the image frame in the highest enhancement layer, as described above. Of course, the code rate may also be a code rate relative to other image frames, for example: the code rate is relative to the image frame in the second last layer, as long as the reference values relative to the code rates of all the image frames are uniform, which is not limited in the present application. Further, when the terminal device performs coding of the video sequence, for each image frame in the N-1 layer, the coding rate of the image frame may be used for coding.

Optionally, the terminal device may calculate according to the following formula

，

……

：

Wherein the content of the first and second substances,

. In the formula

，

……

Is known, and

，

……

has been determined by the terminal device, and therefore, by substituting these already-determined parameters into the above formula, the result can be obtained

，

……

。

For example, if the number of time-domain layers is 2, the code rate of the base layer in one time-domain reference unit is equal to

The number of image frames in the enhancement layer corresponding to the image frames in the base layer in one time domain reference unit is

Code rate of each image frame in the base layer

。

For example, if the number of time-domain layers is 3, the code rate of the base layer in one time-domain reference unit is equal to

The code rate of layer 2 in a time domain reference unit is

The number of image frames in the base layer in the 2 nd layer in one time domain reference unit corresponds to the image frame number

The number of image frames in the layer 2 in one time domain reference unit corresponds to the number of image frames in the layer 3

Code rate of each image frame in the base layer

Code rate of each image frame in layer 2

。

In summary, in the present application, the terminal device may determine a code rate of each image frame in the time-domain reference structure, so as to perform encoding according to the respective code rates of the image frames.

Optionally, the terminal device may further determine whether there is a requirement for packet loss resistance, if there is no requirement for packet loss resistance, the time domain reference structure is not adjusted, and if there is a requirement for packet loss resistance, the time domain reference structure is adjusted. The terminal device may adjust the time domain reference structure in the following adjustment manner, but is not limited thereto: assuming that the number of image frames corresponding to the image frame in the i +1 th layer in an i-th layer in a temporal reference unit is greater than 1, for a last image frame in a plurality of image frames in the i +1 th layer corresponding to the image frame in the i-th layer, a reference image frame of the last image frame may be adjusted to be an image frame in the i-th layer. For example: when the time domain is divided into two layers and has better high network packet loss resistance, a certain Forward Error Correction (FEC) is added to the layer 1, and a time domain reference structure as shown in fig. 7 is selected; without the requirement of packet loss resistance, in order to maximize the image quality and the compression efficiency, the temporal reference structure shown in fig. 6 is adopted, and compared to the temporal reference structure shown in fig. 6, each image frame in the layer 1 in the temporal reference structure of fig. 7 is an image frame in the layer 1 of the reference image frame of the last image frame corresponding to the layer 2, for example: the reference image frame of P2 (2) is P1 (0), and the reference image frame of P2 (5) is P1 (3). For another example: if the time domain is divided into three layers and has better high network packet loss resistance, then a certain FEC is added into the layer 1, and the time domain reference structure shown in the figure 8 is selected; without the requirement of packet loss resistance, in order to maximize the image quality and the compression efficiency, the temporal reference structure shown in fig. 5 is adopted, and compared to the temporal reference structure shown in fig. 5, each image frame in the layer 1 in the temporal reference structure of fig. 8 is an image frame in the layer 1 of the reference image frame of the last image frame corresponding to the layer 2, for example: the reference image frame of P2 (4) is P1 (0), and the reference image frame of P2 (10) is P1 (6).

To sum up, in the present application, the terminal device may determine whether there is a requirement for packet loss resistance, if there is no requirement for packet loss resistance, the time domain reference structure is not adjusted, and if there is a requirement for packet loss resistance, the time domain reference structure is adjusted, so that while video compression efficiency and video smoothness are balanced, packet loss resistance can be better achieved, that is, video smoothness is better achieved.

Fig. 9 is a schematic diagram of a scalable video coding apparatus according to an embodiment of the present application, as shown in fig. 9, the scalable video coding apparatus includes:

a first obtaining module 901, configured to obtain a video sequence.

A second obtaining module 902, configured to obtain a code rate ratio of a time domain reference unit of each of the time domain layering layer number N and the N-1 layer of the video sequence.

A first determining module 903, configured to determine a time domain reference structure of the video sequence according to the number of time domain layering layers and a ratio of code rates of N-1 layers in a time domain reference unit.

And an encoding module 904, configured to encode the video sequence according to the temporal reference structure to obtain an output code stream.

Optionally, the first determining module 903 is specifically configured to: determining the number of image frames corresponding to the (i + 1) th layer of each image frame in the ith layer in one time domain reference unit according to the number of time domain layering layers and the code rate ratio of the N-1 layers in one time domain reference unit, wherein i =1,2 … … N, wherein N represents the number of layers of a time domain reference structure, and the 1 st layer is a basic layer. And determining a time domain reference structure of the video sequence according to the number of the image frames corresponding to the (i + 1) th layer.

Optionally, the scalable video encoding apparatus further includes: a third obtaining module 904, configured to obtain, for a time domain reference unit, a lower limit value of a code rate of each image frame in the N-1 layer before the first determining module 903 determines, according to the number of time domain layering layers and a ratio of code rates of the N-1 layers in each time domain reference unit, the number of image frames corresponding to the i +1 th layer of each image frame in the i-th layer in one time domain reference unit, where the lower limit value of the code rate is a lower limit value of the code rate of each image frame relative to the image frame in the highest enhancement layer. Correspondingly, the first determining module 903 is specifically configured to: and determining the number of image frames corresponding to each image frame in the ith layer in the (i + 1) th layer in one time domain reference unit according to the code rate lower limit value of each image frame in the N-1 layer, the number of time domain layering layers and the code rate ratio of the N-1 layers in one time domain reference unit.

Optionally, the first determining module 903 is specifically configured to: if the number of time domain layering layers is 2, the lower limit value of the code rate of each image frame in the base layer is

The code rate of the base layer in a time domain reference unit is

Then, the number of image frames in the base layer in one temporal reference unit corresponds to the number of image frames in the enhancement layer

。

Optionally, the first determining module 903 is specifically configured to: if the number of time domain layering layers is 3, the lower limit value of the code rate of each image frame in the base layer is

The lower limit value of the code rate of each image frame in the layer 2 is

The code rate of the base layer in a time domain reference unit is

The code rate of layer 2 in a time domain reference unit is

Then, the number of image frames in the base layer in one time domain reference unit corresponds to the image frame number in the layer 2

。

Optionally, the scalable video encoding apparatus further includes: a second determining module 905, configured to, after the first determining module 903 determines the time-domain reference structure of the video sequence according to the number of time-domain layering layers and the bit rate ratio of each of the N-1 layers in one time-domain reference unit, determine a bit rate of each image frame in the N-1 layer according to the time-domain reference structure and the bit rate ratio of each of the N-1 layers in one time-domain reference unit, where the bit rate is a bit rate of each image frame relative to an image frame in a highest enhancement layer. Correspondingly, the encoding module 904 is specifically configured to: and coding the video sequence according to the time domain reference structure and the code rate of each image frame in the N-1 layer to obtain an output code stream.

Optionally, the second determining module 905 is specifically configured to: if the number of time domain layering layers is 2, the code rate of the base layer in one time domain reference unit is

Code rate of each image frame in the base layer

。

Optionally, the second determining module 905 is specifically configured to: if the number of time domain layering layers is 3, the code rate of the base layer in one time domain reference unit is equal to

The code rate of layer 2 in a time domain reference unit is

Code rate of each image frame in the base layer

Code rate of each image frame in layer 2

。

Optionally, the scalable video encoding apparatus further includes: a judging module 906 and an adjusting module 907, where the judging module 906 is configured to judge whether there is a packet loss resistance requirement after the first determining module 903 determines a time domain reference structure of the video sequence according to the number of time domain layering layers and a code rate ratio of N-1 layers in one time domain reference unit. The adjusting module 907 is configured to adjust the time domain reference structure if there is a packet loss tolerance requirement.

It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus shown in fig. 9 may execute the method embodiment corresponding to fig. 4, and the foregoing and other operations and/or functions of each module in the apparatus are respectively for implementing corresponding flows in each method in fig. 4, and are not described herein again for brevity.

The apparatus of the embodiments of the present application is described above in connection with the drawings from the perspective of functional modules. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

As shown in fig. 10, the terminal device may include:

a memory 1010 and a processor 1020, the memory 1010 being adapted to store a computer program and to transfer the program code to the processor 1020. In other words, the processor 1020 can call and run the computer program from the memory 1010 to implement the method in the embodiment of the present application.

For example, the processor 1020 may be configured to perform the above-described method embodiments according to instructions in the computer program.

In some embodiments of the present application, the processor 1020 may include, but is not limited to:

general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the memory 1010 includes, but is not limited to:

volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

In some embodiments of the present application, the computer program can be partitioned into one or more modules that are stored in the memory 1010 and executed by the processor 1020 to perform the methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the terminal device.

As shown in fig. 10, the terminal device may further include:

a transceiver 1030, the transceiver 1030 being connectable to the processor 1020 or the memory 1010.

The processor 1020 may control the transceiver 1030 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 1030 may include a transmitter and a receiver. The transceiver 1030 may further include an antenna, and the number of antennas may be one or more.

It should be understood that the various components in the terminal device are connected by a bus system, wherein the bus system includes a power bus, a control bus and a status signal bus in addition to a data bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and all the changes or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for scalable video coding, comprising:

acquiring a video sequence;

acquiring the code rate ratio of the time domain layering layer number N and the N-1 layers of the video sequence in a time domain reference unit, wherein N is an integer greater than 1;

for the time domain reference unit, acquiring a code rate lower limit value of each image frame in the N-1 layers, wherein the code rate lower limit value is the code rate lower limit value of each image frame relative to the image frame in the highest enhancement layer;

determining the number of image frames corresponding to each image frame in the ith layer in one time domain reference unit in the i +1 th layer according to the code rate lower limit value of each image frame in the N-1 layers, the number N of the time domain layering layers and the code rate ratio of the N-1 layers in one time domain reference unit, wherein i =1,2 … … N-1, and the 1 st layer is a basic layer;

determining a time domain reference structure of the video sequence according to the number of image frames corresponding to each image frame in the ith layer in the ith time domain reference unit in the (i + 1) th layer;

and coding the video sequence according to the time domain reference structure to obtain an output code stream.

2. The method according to claim 1, wherein the determining the number of image frames corresponding to each image frame in the i-th layer in one temporal reference unit at the i + 1-th layer according to the lower limit value of the code rate of each image frame in the N-1 layers, the number N of temporal layering layers, and the ratio of the code rates of the N-1 layers in the temporal reference unit respectively comprises:

if the number of time domain layering layers N is 2, the lower limit value of the code rate of each image frame in the base layer is

The code rate of the base layer in the time domain reference unit is

Then, the number of image frames in the base layer in the one temporal reference unit corresponds to the number of image frames in the enhancement layer

。

3. The method according to claim 1, wherein the determining the number of image frames corresponding to each image frame in the i-th layer in one temporal reference unit at the i + 1-th layer according to the lower limit value of the code rate of each image frame in the N-1 layers, the number N of temporal layering layers, and the ratio of the code rates of the N-1 layers in the temporal reference unit respectively comprises:

if the number of time domain layering layers N is 3, the lower limit value of the code rate of each image frame in the base layer is

The lower limit value of the code rate of each image frame in the layer 2 is

The code rate of the base layer in the time domain reference unit is

The code rate of the layer 2 in the time domain reference unit is

Then, the number of image frames in the base layer in the one temporal reference unit corresponding to the image frame in the layer 2 is the same as the number of image frames in the one temporal reference unit corresponding to the image frame in the base layer

The number of image frames in the layer 2 in the one temporal reference unit corresponds to the number of image frames in the layer 3

。

4. The method according to any one of claims 1-3, wherein said determining the temporal reference structure of the video sequence according to the number of image frames corresponding to the i +1 th layer for each image frame in the i-th layer in the one temporal reference unit further comprises:

determining a code rate of each image frame in the N-1 layers according to the time domain reference structure and the code rate ratio of each of the N-1 layers in one time domain reference unit, wherein the code rate is the code rate of each image frame relative to the image frame in the highest enhancement layer;

correspondingly, the encoding the video sequence according to the temporal reference structure to obtain an output code stream includes:

and coding the video sequence according to the time domain reference structure and the code rate of each image frame in the N-1 layers to obtain an output code stream.

5. The method of claim 4, wherein the determining the code rate for each image frame in the N-1 layers according to the temporal reference structure and the code rate ratio of each of the N-1 layers in one temporal reference unit comprises:

if the time domain layering layer number N is 2, the code rate of the base layer in the time domain reference unit is equal to

The number of image frames in the enhancement layer corresponding to the image frames in the base layer in the one temporal reference unit is

Code rate of each image frame in the base layer

。

6. The method of claim 4, wherein the determining the code rate for each image frame in the N-1 layers according to the temporal reference structure and the code rate ratio of each of the N-1 layers in one temporal reference unit comprises:

if the time domain layering layer number N is 3, the code rate of the base layer in the time domain reference unit is equal to

The code rate of the layer 2 in the time domain reference unit is

The number of image frames in the base layer in the one time domain reference unit corresponds to the number of image frames in the 2 nd layer

Code rate of each image frame in the base layer

Code rate of each image frame in the layer 2

。

7. The method according to any one of claims 1-3, wherein said determining the temporal reference structure of the video sequence according to the number of image frames corresponding to the i +1 th layer for each image frame in the i-th layer in the one temporal reference unit further comprises:

judging whether a packet loss resistance requirement exists or not;

and if the packet loss resistance requirement exists, adjusting the time domain reference structure.

8. A scalable video encoding apparatus, comprising:

the first acquisition module is used for acquiring a video sequence;

a second obtaining module, configured to obtain a code rate ratio of a time domain layering layer number N and N-1 layers of the video sequence in a time domain reference unit;

a third obtaining module, configured to obtain, for the time-domain reference unit, a code rate lower limit value of each image frame in the N-1 layers, where the code rate lower limit value is a code rate lower limit value of each image frame relative to an image frame in a highest enhancement layer;

a first determining module, configured to determine, according to a code rate lower limit value of each image frame in the N-1 layers, the number N of temporal layering layers, and a code rate ratio of each of the N-1 layers in a temporal reference unit, a number of image frames corresponding to an i +1 th layer of each image frame in an i-th layer in the temporal reference unit, where i =1,2 … … N-1, and the 1 st layer is a base layer; determining a time domain reference structure of the video sequence according to the number of image frames corresponding to each image frame in the ith layer in the ith time domain reference unit in the (i + 1) th layer;

and the coding module is used for coding the video sequence according to the time domain reference structure so as to obtain an output code stream.

9. An encoding device, characterized by comprising:

a processor and a memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of any one of claims 1 to 7.

10. A computer-readable storage medium for storing a computer program which causes a computer to perform the method of any one of claims 1 to 7.