CN116567256A

CN116567256A - Hierarchical coding method, hierarchical coding device, hierarchical coding equipment and storage medium

Info

Publication number: CN116567256A
Application number: CN202210103030.XA
Authority: CN
Inventors: 陈思佳; 曹洪彬; 张佳; 黄永铖; 曹健; 杨小祥
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2023-08-08

Abstract

The application provides a layered coding method, a layered coding device, layered coding equipment and a layered coding storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, driving assistance, video and the like, and the layered coding method comprises the following steps: the cloud server acquires decoding capability information of the terminal equipment, wherein the decoding capability information comprises decoding chip types included in the terminal equipment; detecting a network according to the type of the decoding chip when the decoding chip supporting the layered decoding mode exists in the terminal equipment, so as to obtain fluctuation information of the network; and determining parameters corresponding to the layered coding mode according to the fluctuation information of the network and the decoding capability information of the terminal equipment, and coding the target video by using the parameters corresponding to the layered coding mode. The optimal hierarchical coding configuration is selected according to the network fluctuation condition and the decoding capability of the terminal equipment, so that the coding quality is improved.

Description

Hierarchical coding method, hierarchical coding device, hierarchical coding equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a layered coding method, a layered coding device, layered coding equipment and a storage medium.

Background

Layered coding (SVC) is one type of video coding, and divides a video signal into multiple layers to code, so as to obtain multiple layers of code streams, wherein the multiple layers of code streams have dependence, one code stream is a code stream of a base layer, and the other code streams are code streams of an enhancement layer. When the bandwidth is insufficient, only the code stream of the basic layer is transmitted and decoded, but the decoded video is not high in quality. As the bandwidth slowly increases, the enhancement layer code stream may be transmitted and decoded to improve the decoding quality of the video.

Currently, in order to adapt to different network conditions, a cloud server adopts a layered coding mode for coding by default, and layered coding parameters used in coding are also default. However, in some cases, when hierarchical coding is performed using default hierarchical coding parameters, there is a problem that the coding quality is poor.

Disclosure of Invention

The application provides a layered coding method, a layered coding device, layered coding equipment and a storage medium, so that the suitability of layered coding parameters, terminal equipment and a network is improved, and the coding effect is further improved.

In a first aspect, the present application provides a hierarchical encoding method, applied to a cloud server, including:

acquiring decoding capability information of terminal equipment, wherein the decoding capability information comprises decoding chip types included in the terminal equipment;

Detecting a network to obtain fluctuation information of the network when determining that a decoding chip supporting a layered decoding mode exists in the terminal equipment according to the decoding chip type;

and determining parameters corresponding to the layered coding mode according to the fluctuation information of the network and the decoding capability information of the terminal equipment, and coding the target video by using the parameters corresponding to the layered coding mode.

In a second aspect, the present application provides a layered coding apparatus, applied to a cloud server, including:

an obtaining unit, configured to obtain decoding capability information of a terminal device, where the decoding capability information includes a decoding chip type included in the terminal device;

the network detection unit is used for detecting a network to obtain fluctuation information of the network when determining that a decoding chip supporting a layered decoding mode exists in the terminal equipment according to the decoding chip type;

and the coding unit is used for determining parameters corresponding to the layered coding mode according to the fluctuation information of the network and the decoding capability information of the terminal equipment, and coding the target video by using the parameters corresponding to the layered coding mode.

In a third aspect, an electronic device is provided, comprising: a processor and a memory for storing a computer program, the processor being for invoking and running the computer program stored in the memory to perform the method of the first aspect.

In a fourth aspect, a computer-readable storage medium is provided for storing a computer program that causes a computer to perform the method of the first aspect.

In a fifth aspect, a chip is provided for implementing the method in any one of the above first aspects or each implementation thereof. Specifically, the chip includes: a processor for calling and running a computer program from a memory, causing a device on which the chip is mounted to perform the method as in any one of the first aspects or implementations thereof.

In a sixth aspect, there is provided a computer program product comprising computer program instructions for causing a computer to perform the method of any one of the above aspects or implementations thereof.

In a seventh aspect, there is provided a computer program which, when run on a computer, causes the computer to perform the method of any one of the above-described first aspects or implementations thereof.

In summary, in the present application, a cloud server acquires decoding capability information of a terminal device, where the decoding capability information includes a decoding chip type included in the terminal device; detecting a network according to the type of the decoding chip when the decoding chip supporting the layered decoding mode exists in the terminal equipment, so as to obtain fluctuation information of the network; and determining parameters corresponding to the layered coding mode according to the fluctuation information of the network and the decoding capability information of the terminal equipment, and coding the target video by using the parameters corresponding to the layered coding mode. Namely, according to the embodiment of the application, the optimal hierarchical coding configuration is selected according to the network fluctuation condition and the decoding capability of the terminal equipment, so that the coding quality is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application;

FIG. 2 is a schematic block diagram of a video encoder provided by an embodiment of the present application;

FIG. 3 is a schematic block diagram of a video decoder provided by an embodiment of the present application;

FIG. 4 is a flowchart of a layered coding method according to an embodiment of the present application;

FIG. 5 is a diagram of the number of encoded reference frames according to an embodiment of the present application;

FIG. 6 is a diagram of the number of encoded reference frames according to another embodiment of the present application;

FIG. 7 is a flowchart of a hierarchical encoding method according to an embodiment of the present application;

FIG. 8 is an interactive flow chart of a layered coding method according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a layered coding apparatus according to an embodiment of the present disclosure;

fig. 10 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

First, related concepts related to the embodiments of the present application will be described:

frame rate: the frame rate is a definition in the field of images, and refers to the number of frames per second transmitted by a picture, and in colloquial terms, to the number of pictures of an animation or video. The larger the frame rate, the smoother the picture.

Layered coding (SVC): is one of video coding, which codes a video signal into a layered form, and only transmits and decodes a code stream of a base layer when a bandwidth is insufficient, but the quality of the decoded video is not high. As the bandwidth slowly increases, the enhancement layer code stream may be transmitted and decoded to improve the decoding quality of the video.

Base layer: the video frames at the base layer are video frames that are not discardable in layered coding and can be used as references for subsequent video frames, if discarded, can result in decoding failure or screen-splash.

Enhancement layer: video frames at the enhancement layer are video frames that can be discarded at will in layered coding, and will not be a reference for subsequent video frames if discarding has no effect on the SVC-capable decoder.

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application, including a cloud server 101 and a terminal device 102. The cloud server 101 may be understood as an encoding device, and the terminal device 102 may be understood as a decoding device.

The cloud server 101 is configured to encode (may be understood as compressing) video data to generate a code stream, and transmit the code stream to the terminal device 102.

The cloud server 101 of the present embodiment may be understood as a device with a video encoding function, and the terminal device 102 may be understood as a device with a video decoding function, that is, the embodiments of the present application include a wider device for the cloud server 101 and the terminal device 102, such as a smart phone, a desktop computer, a mobile computing device, a notebook (e.g., laptop) computer, a tablet computer, a set-top box, a television, a camera, a display device, a digital media player, a video game console, a vehicle-mounted computer, and the like.

In some embodiments, cloud server 101 may transmit encoded video data (e.g., a code stream) to terminal device 102 via channel 103. Channel 103 may comprise one or more media and/or devices capable of transmitting encoded video data from cloud server 101 to terminal device 102.

In one example, channel 103 includes one or more communication media that enable cloud server 101 to transmit encoded video data directly to terminal device 102 in real-time. In this example, cloud server 101 may modulate the encoded video data according to a communication standard and transmit the modulated video data to terminal device 102. Where the communication medium comprises a wireless communication medium, such as a radio frequency spectrum, the communication medium may optionally also comprise a wired communication medium, such as one or more physical transmission lines.

In another example, channel 103 comprises a storage medium that may store video data encoded by cloud server 101. Storage media include a variety of locally accessed data storage media such as compact discs, DVDs, flash memory, and the like. In this example, the terminal device 102 may obtain encoded video data from the storage medium.

In another example, channel 103 may comprise a storage server that may store video data encoded by cloud server 101. In this example, the terminal device 102 may download stored encoded video data from the storage server. Alternatively, the storage server may store the encoded video data and may transmit the encoded video data to the terminal device 102, such as a web server (e.g., for a website), a File Transfer Protocol (FTP) server, or the like.

In some embodiments, cloud server 101 includes a video encoder and an output interface. The output interface may comprise, among other things, a modulator/demodulator (modem) and/or a transmitter.

In some embodiments, cloud server 101 may include a video source in addition to a video encoder and an input interface.

The video source may include at least one of a video capture device (e.g., a video camera), a video archive, a video input interface for receiving video data from a video content provider, a computer graphics system for generating video data.

A video encoder encodes video data from a video source to produce a bitstream. The video data may include one or more pictures (pictures) or sequences of pictures (sequence of pictures). The code stream contains encoded information of the image or image sequence in the form of a bit stream. The encoded information may include encoded image data and associated data. The associated data may include a sequence parameter set (sequence parameter set, SPS for short), a picture parameter set (picture parameter set, PPS for short), and other syntax structures. An SPS may contain parameters that apply to one or more sequences. PPS may contain parameters that apply to one or more pictures. A syntax structure refers to a set of zero or more syntax elements arranged in a specified order in a bitstream.

The video encoder transmits the encoded video data directly to the terminal device 102 via an output interface. The encoded video data may also be stored on a storage medium or a storage server for subsequent reading by the terminal device 102.

In some embodiments, the terminal device 102 includes an input interface and a video decoder.

In some embodiments, the terminal device 102 may include a display in addition to an input interface and a video decoder.

Wherein the input interface comprises a receiver and/or a modem. The input interface may receive encoded video data over a channel.

The video decoder is used for decoding the encoded video data to obtain decoded video data, and transmitting the decoded video data to the display device.

The display device displays the decoded video data. The display means may be integrated with the terminal device or external to the terminal device. The display device may include a variety of display devices, such as a Liquid Crystal Display (LCD), a plasma display, an Organic Light Emitting Diode (OLED) display, or other types of display devices.

Alternatively, the cloud server 101 may be one or more. When the cloud server 101 is plural, there are at least two servers for providing different services, and/or there are at least two servers for providing the same service, for example, providing the same service in a load balancing manner, which is not limited in the embodiment of the present application.

Alternatively, the cloud server 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content distribution networks), and basic cloud computing services such as big data and artificial intelligence platforms. Cloud server 101 may also become a node of the blockchain.

In some embodiments, the cloud server 101 is a cloud server with powerful computing resources, and is characterized by high virtualization and high distribution.

In some embodiments, the present application may be applied to the field of image encoding and decoding, the field of video encoding and decoding, the field of hardware video encoding and decoding, the field of dedicated circuit video encoding and decoding, the field of real-time video encoding and decoding, and the like. For example, the schemes of the present application may be incorporated into audio video coding standards (audio video coding standard, AVS for short), such as the h.264/audio video coding (audio video coding, AVC for short) standard, the h.265/high efficiency video coding (high efficiency video coding, HEVC for short) standard, and the h.266/multi-function video coding (versatile video coding, VVC for short) standard. Alternatively, the schemes of the present application may operate in conjunction with other proprietary or industry standards including ITU-T H.261, ISO/IECMPEG-1Visual, ITU-T H.262 or ISO/IECMPEG-2Visual, ITU-T H.263, ISO/IECMPEG-4Visual, ITU-T H.264 (also known as ISO/IECMPEG-4 AVC), including Scalable Video Codec (SVC) and Multiview Video Codec (MVC) extensions. It should be understood that the techniques of this application are not limited to any particular codec standard or technique.

The following describes a video coding framework according to an embodiment of the present application.

Fig. 2 is a schematic block diagram of a video encoder provided by an embodiment of the present application. It should be appreciated that the video encoder 200 may be used for lossy compression of images (lossy compression) and may also be used for lossless compression of images (lossless compression). The lossless compression may be visual lossless compression (visually lossless compression) or mathematical lossless compression (mathematically lossless compression).

The video encoder 200 may be applied to image data in luminance and chrominance (YCbCr, YUV) format.

For example, the video encoder 200 reads video data, divides a frame of image into a number of Coding Tree Units (CTUs) for each frame of image in the video data, and in some examples, CTBs may be referred to as "tree blocks", "maximum coding units" (Largest Coding unit, LCUs) or "coding tree blocks" (coding tree block, CTBs). Each CTU may be associated with a block of pixels of equal size within the image. Each pixel may correspond to one luminance (or luma) sample and two chrominance (or chroma) samples. Thus, each CTU may be associated with one block of luma samples and two blocks of chroma samples. One CTU size is, for example, 128×128, 64×64, 32×32, etc. One CTU may be further divided into several Coding Units (CUs), where a CU may be a rectangular block or a square block. The CU may be further divided into a Prediction Unit (PU) and a Transform Unit (TU), so that the encoding, the prediction, and the transform are separated, and the processing is more flexible. In one example, CTUs are divided into CUs in a quadtree manner, and CUs are divided into TUs, PUs in a quadtree manner.

Video encoders and video decoders may support various PU sizes. Assuming that the size of a particular CU is 2nx2n, video encoders and video decoders may support 2 nx2n or nxn PU sizes for intra prediction and support 2 nx2n, 2 nx N, N x 2N, N x N or similar sized symmetric PUs for inter prediction. Video encoders and video decoders may also support asymmetric PUs of 2nxnu, 2nxnd, nL x 2N, and nR x 2N for inter prediction.

In some embodiments, as shown in fig. 2, the video encoder 200 may include: a prediction unit 210, a residual unit 220, a transform/quantization unit 230, an inverse transform/quantization unit 240, a reconstruction unit 250, a loop filtering unit 260, a decoded image buffer 270, and an entropy encoding unit 280. It should be noted that video encoder 200 may include more, fewer, or different functional components.

Alternatively, in this application, a current block (current block) may be referred to as a current Coding Unit (CU) or a current Prediction Unit (PU), or the like. The prediction block may also be referred to as a prediction image block or an image prediction block, and the reconstructed image block may also be referred to as a reconstructed block or an image reconstructed image block.

In some embodiments, prediction unit 210 includes an inter prediction unit 211 and an intra estimation unit 212. Because of the strong correlation between adjacent pixels in a frame of video, intra-prediction methods are used in video coding techniques to eliminate spatial redundancy between adjacent pixels. Because of the strong similarity between adjacent frames in video, the inter-frame prediction method is used in the video coding and decoding technology to eliminate the time redundancy between adjacent frames, thereby improving the coding efficiency.

The inter prediction unit 211 may be used for inter prediction, which may refer to image information of different frames, using motion information to find a reference block from the reference frame, generating a prediction block from the reference block, for eliminating temporal redundancy; the frames used for inter-prediction may be P frames, which refer to forward predicted frames, and/or B frames, which refer to bi-directional predicted frames. The motion information includes a reference frame list in which the reference frame is located, a reference frame index, and a motion vector. The motion vector may be integer or sub-pixel, and if the motion vector is sub-pixel, then interpolation filtering is required to make the required sub-pixel block in the re-reference frame, where the integer or sub-pixel block in the reference frame found from the motion vector is referred to as the reference block. Some techniques may use the reference block directly as a prediction block, and some techniques may reprocess the reference block to generate a prediction block. Reprocessing a prediction block on the basis of a reference block is also understood to mean that the reference block is taken as a prediction block and then a new prediction block is processed on the basis of the prediction block.

The most commonly used inter prediction methods at present include: geometric partitioning modes (geometric partitioning mode, GPM) in VVC video codec standard, and angle weighted prediction (angular weighted prediction, AWP) in AVS3 video codec standard. These two intra prediction modes are in principle common.

The intra estimation unit 212 predicts pixel information within the current code image block for eliminating spatial redundancy by referring to only information of the same frame image. The frames used for intra prediction may be key frames.

The intra prediction modes used by HEVC are Planar mode (Planar), DC, and 33 angular modes, for a total of 35 prediction modes. The intra modes used by VVC are Planar, DC and 65 angular modes, for a total of 67 prediction modes. The intra modes used by AVS3 are DC, plane, bilinear and 63 angular modes, for a total of 66 prediction modes.

In some embodiments, intra-estimation unit 212 may be implemented using intra-block copy techniques and intra-string copy techniques.

Residual unit 220 may generate a residual block of the CU based on the pixel block of the CU and the prediction block of the PU of the CU. For example, residual unit 220 may generate a residual block of the CU such that each sample in the residual block has a value equal to the difference between: samples in pixel blocks of a CU, and corresponding samples in prediction blocks of PUs of the CU.

The transform/quantization unit 230 may quantize the transform coefficients. Transform/quantization unit 230 may quantize transform coefficients associated with TUs of a CU based on Quantization Parameter (QP) values associated with the CU. The video encoder 200 may adjust the degree of quantization applied to the transform coefficients associated with the CU by adjusting the QP value associated with the CU.

The inverse transform/quantization unit 240 may apply inverse quantization and inverse transform, respectively, to the quantized transform coefficients to reconstruct a residual block from the quantized transform coefficients.

The reconstruction unit 250 may add samples of the reconstructed residual block to corresponding samples of one or more prediction blocks generated by the prediction unit 210 to generate a reconstructed image block associated with the TU. In this way, reconstructing sample blocks for each TU of the CU, video encoder 200 may reconstruct pixel blocks of the CU.

Loop filtering unit 260 may perform a deblocking filtering operation to reduce blocking artifacts of pixel blocks associated with the CU.

In some embodiments, the loop filtering unit 260 includes a deblocking filtering unit for deblocking artifacts and a sample adaptive compensation/adaptive loop filtering (SAO/ALF) unit for removing ringing effects.

The decoded image buffer 270 may store reconstructed pixel blocks. Inter prediction unit 211 may use the reference image containing the reconstructed pixel block to perform inter prediction on PUs of other images. In addition, intra estimation unit 212 may use the reconstructed pixel blocks in decoded image buffer 270 to perform intra prediction on other PUs in the same image as the CU.

The entropy encoding unit 280 may receive the quantized transform coefficients from the transform/quantization unit 230. Entropy encoding unit 280 may perform one or more entropy encoding operations on the quantized transform coefficients to generate entropy encoded data.

Fig. 3 is a schematic block diagram of a video decoder provided by an embodiment of the present application.

As shown in fig. 3, the video decoder 300 includes: an entropy decoding unit 310, a prediction unit 320, an inverse quantization/transformation unit 330, a reconstruction unit 340, a loop filtering unit 350, and a decoded image buffer 360. It should be noted that the video decoder 300 may include more, fewer, or different functional components.

The video decoder 300 may receive the bitstream. The entropy decoding unit 310 may parse the bitstream to extract syntax elements from the bitstream. As part of parsing the bitstream, the entropy decoding unit 310 may parse entropy-encoded syntax elements in the bitstream. The prediction unit 320, the inverse quantization/transformation unit 330, the reconstruction unit 340, and the loop filtering unit 350 may decode video data according to syntax elements extracted from a bitstream, i.e., generate decoded video data.

In some embodiments, prediction unit 320 includes an intra prediction unit 322 and an inter prediction unit 321.

Intra prediction unit 322 may perform intra prediction to generate a prediction block for the PU. Intra-prediction unit 322 may use an intra-prediction mode to generate a prediction block for the PU based on pixel blocks of spatially-neighboring PUs. Intra-prediction unit 322 may also determine an intra-prediction mode for the PU based on one or more syntax elements parsed from the bitstream.

The inter prediction unit 321 may construct a first reference picture list (list 0) and a second reference picture list (list 1) according to syntax elements parsed from the bitstream. Furthermore, if the PU uses inter prediction encoding, entropy decoding unit 310 may parse the motion information of the PU. Inter prediction unit 321 may determine one or more reference blocks of the PU from the motion information of the PU. Inter prediction unit 321 may generate a prediction block of a PU from one or more reference blocks of the PU.

The inverse quantization/transform unit 330 may inverse quantize (i.e., dequantize) transform coefficients associated with the TUs. Inverse quantization/transform unit 330 may determine the degree of quantization using QP values associated with the CUs of the TUs.

After inverse quantizing the transform coefficients, inverse quantization/transform unit 330 may apply one or more inverse transforms to the inverse quantized transform coefficients in order to generate a residual block associated with the TU.

Reconstruction unit 340 uses the residual blocks associated with the TUs of the CU and the prediction blocks of the PUs of the CU to reconstruct the pixel blocks of the CU. For example, the reconstruction unit 340 may add samples of the residual block to corresponding samples of the prediction block to reconstruct a pixel block of the CU, resulting in a reconstructed image block.

Loop filtering unit 350 may perform a deblocking filtering operation to reduce blocking artifacts of pixel blocks associated with the CU.

The video decoder 300 may store the reconstructed image of the CU in a decoded image buffer 360. The video decoder 300 may use the reconstructed image in the decoded image buffer 360 as a reference image for subsequent prediction or may transmit the reconstructed image to a display device for presentation.

The basic flow of video encoding and decoding is as follows: at the encoding end, one frame of image is divided into blocks, and for a current block, the prediction unit 210 generates a prediction block of the current block using intra prediction or inter prediction. The residual unit 220 may calculate a residual block, which may also be referred to as residual information, based on the difference between the prediction block and the original block of the current block, i.e., the prediction block and the original block of the current block. The residual block is transformed and quantized by the transforming/quantizing unit 230, and the like, so that information insensitive to human eyes can be removed to eliminate visual redundancy. Alternatively, the residual block before being transformed and quantized by the transforming/quantizing unit 230 may be referred to as a time domain residual block, and the time domain residual block after being transformed and quantized by the transforming/quantizing unit 230 may be referred to as a frequency residual block or a frequency domain residual block. The entropy encoding unit 280 receives the quantized change coefficient output from the change quantization unit 230, and may entropy encode the quantized change coefficient to output a bitstream. For example, the entropy encoding unit 280 may eliminate character redundancy according to the target context model and probability information of the binary code stream.

At the decoding end, the entropy decoding unit 310 may parse the code stream to obtain prediction information of the current block, a quantization coefficient matrix, etc., and the prediction unit 320 generates a prediction block of the current block using intra prediction or inter prediction on the current block based on the prediction information. The inverse quantization/transformation unit 330 performs inverse quantization and inverse transformation on the quantized coefficient matrix using the quantized coefficient matrix obtained from the code stream to obtain a residual block. The reconstruction unit 340 adds the prediction block and the residual block to obtain a reconstructed block. The reconstructed blocks constitute a reconstructed image, and the loop filtering unit 350 performs loop filtering on the reconstructed image based on the image or based on the blocks, resulting in a decoded image. The encoding side also needs to obtain a decoded image in a similar operation to the decoding side. The decoded image may also be referred to as a reconstructed image, which may be a subsequent frame as a reference frame for inter prediction.

The block division information determined by the encoding end, and mode information or parameter information such as prediction, transformation, quantization, entropy coding, loop filtering, etc. are carried in the code stream when necessary. The decoding end analyzes the code stream and analyzes and determines the same block division information as the encoding end according to the existing information, and predicts, transforms, quantizes, entropy codes, loop filters and other mode information or parameter information, so that the decoded image obtained by the encoding end is ensured to be the same as the decoded image obtained by the decoding end.

The foregoing is a basic flow of a video codec under a block-based hybrid coding framework, and as technology advances, some modules or steps of the framework or flow may be optimized.

In some embodiments, embodiments of the present invention may be applied to a variety of scenarios, including, but not limited to, cloud technology (e.g., cloud gaming), artificial intelligence, intelligent transportation, assisted driving, and the like.

In some embodiments, the methods of embodiments of the present application may apply Yu Duan cloud co-coding. The end cloud collaborative coding refers to a scheme of compressing video by collaborative coding of a cloud end and a terminal. Because the computing power of the video content producer (cloud) and the computing power of the video content consumer (terminal) are different, a relatively complex video compression task can be completed cooperatively by two ends, so that the cloud resources and the powerful computing power (such as encoding power) can be utilized, the data volume of network transmission is reduced, and the computing power (such as decoding power) of the terminal is also effectively utilized. The method can be used for scenes such as cloud games.

In some embodiments, video coding is coordinated, and optimal coding configuration and coding strategy are selected according to the coding and decoding capability of the intelligent terminal and in combination with the game type and the user network type.

The cloud end cooperative protocol refers to a unified protocol for data interaction between a cloud server and an intelligent terminal.

The intelligent terminal cooperative interface is an intelligent terminal software and hardware module interface, and can effectively interact with the intelligent terminal through the interface, configure video coding and rendering parameters and acquire real-time operation performance of hardware.

Decoding performance refers to the highest decoding frame rate and single frame decoding delay supported for a given video size under a particular decoding protocol. The video size is defined as follows: 360p,576p,720p,1080p,2k,4k. The video frame rate is defined as follows: 30fps,40fps,50fps,60fps,90fps,120fps.

The definition of the video resolution and video frame rate of the terminal device is shown in tables 1 and 2.

Table 1 definition of video resolution of terminal device

Video resolution	Enumeration definition
		360p	0x1
576p	0x2
		720p	0x4
1080p	0x8
		2k	0x10
4k	0x20

Table 2 definition of video resolution and video frame rate for terminal equipment

Optionally, the decoding performance supported by the terminal device is given in the form of triples, the first element is an enumeration definition of video resolution, the second element is an enumeration definition of video frame rate, the third element is a single frame decoding delay at video resolution and video frame rate, for example, H264 decoding of device a, and the single frame decoding delay at 720p@60fps is 10ms, which is denoted as (4, 8, 10).

The video coding collaborative optimization scheme is that a cloud server determines a coding function set to be started according to game types and network conditions, and then determines the optimal coding configuration of the current equipment through equipment types and coding capacities reported by an intelligent terminal.

In some embodiments, the terminal device decoding capability data structure requirements are shown in table 3.

Table 3 decoding capability data structure requirements for terminal devices

And the cloud server determines the optimal encoding and decoding configuration such as the decoding protocol, the decoding resolution, the video frame rate and the like of the current equipment, and encoding and decoding strategies such as the number of video encoding reference frames, SVC starting and the like according to the decoding capability of the intelligent terminal and by combining the game type and the user network condition.

The following describes the technical solutions of the embodiments of the present application in detail through some embodiments. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

First, an encoding end will be described as an example.

Fig. 4 is a flowchart of a layered coding method according to an embodiment of the present application, where the method may be applied to the cloud server 101 shown in fig. 1 or to the encoder shown in fig. 2. The following description will take an execution subject as a cloud server as an example.

As shown in fig. 4, the embodiment of the present application includes the following steps:

s401, decoding capability information of the terminal equipment is acquired.

The terminal device of the embodiments of the present application may be understood as a decoding device comprising at least one decoding chip, which in some embodiments is also referred to as decoder. After receiving the code stream, the terminal equipment decodes the code stream by a decoding chip in the terminal equipment to obtain a corresponding video.

At present, default layered coding parameters are adopted for layered coding for different networks and different terminal devices, and the higher the number of coding layers is, the lower the coding efficiency is, the worse the quality is, and default layered coding configuration is adopted under different network conditions and terminal device performances, so that the problem of reduced coding quality can be caused.

In order to solve the above technical problems, according to the embodiment of the present application, according to the network fluctuation condition and the decoding capability of the terminal device, an optimal hierarchical coding configuration is selected, so as to improve the coding quality.

The decoding capability information of the terminal device includes decoding chip types included in the terminal device, such as an H264 decoding chip, an HEVC decoding chip, and the like. Some decoding chips support a layered decoding mode, that is, have the capability of performing layered decoding on a code stream obtained by encoding in a layered encoding mode. Some decoding chips do not support a layered decoding mode, i.e., do not have the capability of performing layered decoding on a code stream obtained by encoding in a layered encoding mode.

The embodiment of the application can determine whether the decoding chip supports a layered decoding mode through the type of the decoding chip.

In some embodiments, the decoding capability information of the terminal device may further include a maximum decoding reference frame number supported by the terminal device, in addition to the above decoding chip type.

In some embodiments, the maximum number of decoded reference frames is also referred to as the maximum decoded reference buffer size supported by the terminal device. For example, when the number of maximum decoding reference frames supported by the terminal device is M, it indicates that M Zhang Cankao frames can be buffered maximally in the maximum decoding reference buffer supported by the terminal device, that is, the maximum decoding reference buffer size supported by the terminal device is M.

As can be seen from the above, the terminal device may include a plurality of decoding chips supporting the layered decoding method, and each decoding chip corresponds to a maximum number of decoded reference frames. In the embodiment of the application, the maximum decoding reference frame number corresponding to the decoding chip which supports the layered decoding mode and has the optimal decoding performance in the terminal equipment is determined as the maximum decoding reference frame number supported by the terminal equipment. For example, the coding performance of HEVC is better than 264, if the terminal equipment includes both an HEVC decoding chip and a 264 decoding chip supporting a layered decoding mode, the HEVC decoding chip is preferentially selected as an optimal decoding chip, and then the maximum decoding reference frame number corresponding to the HEVC decoding chip is used as the maximum decoding reference frame number of the terminal equipment.

The method for acquiring the decoding capability information of the terminal device in the embodiment of the present application includes, but is not limited to, the following several methods:

in one mode, decoding capability information of the terminal device is obtained by accessing a hardware interface of the terminal device.

In a second mode, a first decoding capability request is sent to the terminal device, and first response information sent by the terminal device according to the first decoding capability request is received, wherein the first decoding capability request is used for requesting decoding parameters of the terminal device, and the first response information comprises decoding capability information of the terminal device.

Optionally, the decoding information includes all types of decoding chips included in the terminal device, that is, includes a chip type supporting the layered decoding mode and a chip type not supporting the layered decoding mode. Correspondingly, the maximum decoding reference frame number included in the decoding information is the maximum decoding reference frame number corresponding to all decoding chips.

Optionally, the decoding information includes a chip type supporting the layered decoding mode in the terminal device, and does not include a chip type not supporting the layered decoding mode. Correspondingly, the maximum decoding reference frame number included in the decoding information is the maximum decoding reference frame number corresponding to the chip supporting the layered decoding mode.

In the third mode, if the decoding capability information includes the maximum number of decoding reference frames supported by the terminal device, the maximum number of decoding reference frames supported by the terminal device may be obtained through the following steps S401-A1 to S401-A3:

S401-A1, sending a second decoding capability request to the terminal equipment, wherein the second decoding capability request comprises video code streams with different reference frame numbers, and the second decoding capability request is used for requesting a decoding result of the terminal equipment about the video code streams with different reference frame numbers;

S401-A2, receiving second response information sent by the terminal equipment, wherein the second response information comprises decoding results of video code streams of different reference frame numbers of the terminal equipment;

S401-A3, determining the maximum decoding reference frame number supported by the terminal equipment according to the decoding result.

In the third mode, the cloud server sends video code streams with different reference frame numbers to the terminal equipment, the terminal equipment respectively decodes and renders the video code streams with different reference frame numbers, when the maximum decoding reference frame number supported by the terminal equipment is greater than or equal to the reference frame number of the video code streams, the terminal equipment can normally decode the video code streams, and when the maximum decoding reference frame number supported by the terminal equipment is less than the reference frame number of the video code streams, the terminal equipment cannot normally decode the video code streams. Therefore, the maximum decoding reference frame number supported by the terminal equipment can be obtained by judging whether the terminal equipment can normally decode video code streams with different reference frame numbers (namely whether error prompt/screen display occurs or not). For example, the cloud server sends the video code stream 1 and the video code stream 2 to the terminal device, the number of the coding reference frames of the video code stream 1 is 6, the number of the coding reference frames of the video code stream 2 is 7, the terminal device can normally decode the video code stream 1, the video code stream 2 cannot normally decode, and further it can be determined that the maximum decoding reference frame number supported by the terminal device is 6.

The present application requires that the maximum number of decoded reference frames supported by the terminal device is greater than or equal to 2.

According to the above method, after obtaining the decoding reference information of the terminal device, the following S402 is performed.

S402, detecting the network according to the type of the decoding chip when the decoding chip supporting the layered decoding mode exists in the terminal equipment, and obtaining the fluctuation information of the network.

According to the method, decoding capability information of the terminal equipment is obtained, wherein the decoding capability information comprises decoding chip types included in the terminal equipment. According to the type of the decoding chip included in the terminal equipment, firstly judging whether a decoding chip supporting the layered decoding mode exists in the terminal equipment, and when the fact that the decoding chip supporting the layered decoding mode does not exist in the terminal equipment is determined, determining that the current target video to be coded is not coded by adopting the layered coding mode.

And detecting the network only when the decoding chip supporting the layered decoding mode exists in the terminal equipment, so as to obtain the fluctuation condition of the current network.

The embodiment of the application uses a period of time as a network detection window, for example, uses 30 seconds as a detection window, that is, the fluctuation information of the network is unified several times every 30 seconds.

The embodiment of the application does not limit the fluctuation information of the network.

In some embodiments, the fluctuation information of the network includes at least one of a number of detected fluctuation of the network within a current detection window, a fluctuation variance, a key frame (e.g., I-frame) application number, a maximum fluctuation value, and a minimum fluctuation value.

It should be noted that, in the embodiment of the present application, the maximum fluctuation value and the minimum fluctuation value are not the maximum fluctuation value and the minimum fluctuation value that the network reaches once or several times in the current detection window, but the maximum fluctuation value is recorded as the maximum fluctuation value of the network when the number of times that a certain maximum fluctuation value reaches in the current detection window is greater than the first preset number of times. And similarly, when the number of times of reaching a certain minimum fluctuation value in the current detection window is larger than a second preset number of times, the minimum fluctuation value is recorded as a maximum fluctuation value of the network. Optionally, the first preset number of times is the same as the second preset number of times.

S403, determining parameters corresponding to the layered coding mode according to the fluctuation information of the network and the decoding capability information of the terminal equipment, and coding the target video by using the parameters corresponding to the layered coding mode.

In the embodiment of the application, before encoding the target video, the decoding information of the terminal equipment and the fluctuation information of the network are obtained first, and then parameters corresponding to the layered encoding mode are determined according to the decoding information of the terminal equipment and the fluctuation information of the network, so as to obtain layered encoding parameters adapting to the decoding capability of the terminal equipment and the fluctuation condition of the network. When the layered coding parameters which are adaptive to the decoding capability of the terminal equipment and the fluctuation condition of the network are used for coding, the coding quality can be improved.

In some embodiments, before performing S403 described above, it is further required to determine whether to encode the target video using a layered coding scheme according to the fluctuation information of the network. When it is determined that the target video is encoded using the layered coding scheme, the step S403 is performed, and parameters corresponding to the layered coding scheme are determined according to the fluctuation information of the network and the decoding capability information of the terminal device, and the target video is encoded using the parameters corresponding to the layered coding scheme.

That is, before S403, the method in the embodiment of the present application further includes the following step a:

and step A, determining whether to encode the target video by using a layered encoding mode according to the fluctuation information of the network.

Because the coding process of the layered coding mode is complex, the coding efficiency is low, and when the network is stable, the cloud server transmits all the code streams obtained by the layered coding mode to the terminal equipment, and the sizes (for example, 30 fps) of all the code streams corresponding to the layered coding mode are the same as those of the code streams obtained by the non-layered coding mode. Therefore, the hierarchical coding mode is not adopted when the network is stable, so that the coding complexity is reduced, and the hierarchical coding mode is adopted when the network is unstable.

The implementation manner of the step A includes, but is not limited to the following:

in one aspect, if the fluctuation information includes at least one of a fluctuation frequency, a fluctuation variance, a fluctuation maximum value, and a fluctuation minimum value of the network, the step a includes the steps of:

and step A-11, determining to code the target video by adopting a layered coding mode if at least one of the fluctuation times, the fluctuation variance, the maximum fluctuation value and the minimum fluctuation value of the network is larger than or equal to a corresponding preset threshold value.

And step A-12, determining that the hierarchical coding mode is not adopted to code the target video if the fluctuation times, the fluctuation variance, the maximum fluctuation value and the minimum fluctuation value of the network are smaller than the corresponding preset threshold values.

In the first mode, when at least one of the fluctuation times, the fluctuation variance, the maximum fluctuation value and the minimum fluctuation value of the network is larger than or equal to a corresponding preset threshold value, the current network is unstable, and at the moment, the target video is determined to be encoded by adopting a layered encoding mode.

For example, in the current detection window, when the fluctuation frequency of the network is greater than or equal to a preset threshold corresponding to the fluctuation frequency, and/or the fluctuation variance of the network is greater than or equal to a preset threshold corresponding to the fluctuation variance, and/or the maximum fluctuation value of the network is greater than or equal to a preset threshold corresponding to the maximum fluctuation value, and/or the minimum fluctuation value of the network is greater than or equal to a preset threshold corresponding to the minimum fluctuation value, it is determined that the hierarchical coding mode is adopted to code the target video.

And in the current detection window, if the fluctuation times, the fluctuation variance, the maximum fluctuation value and the minimum fluctuation value of the network are smaller than the corresponding preset threshold values, the current network is stable, and it is determined that the target video is not encoded by adopting a layered encoding mode.

For example, if the fluctuation information includes the fluctuation frequency and the fluctuation variance of the network, when the fluctuation frequency of the network is smaller than a preset threshold corresponding to the fluctuation frequency and the fluctuation variance of the network is smaller than a preset threshold corresponding to the fluctuation variance, the current network is stable, and it is determined that the target video is not encoded by adopting the hierarchical encoding mode.

For another example, if the fluctuation information includes the fluctuation frequency, the fluctuation variance and the maximum fluctuation value of the network, when the fluctuation frequency of the network is smaller than a preset threshold corresponding to the fluctuation frequency, the fluctuation variance of the network is smaller than a preset threshold corresponding to the fluctuation variance, and the maximum fluctuation value of the network is smaller than a preset threshold corresponding to the maximum fluctuation value, the current network is stable, and it is determined that the target video is not encoded by adopting the layered encoding mode.

That is, if the fluctuation information of the network includes one or more of the fluctuation times, the fluctuation variance, the maximum fluctuation value and the minimum fluctuation value of the network, when the one or more are smaller than the corresponding preset threshold value, the current network is stable, and it is determined that the hierarchical coding mode is not adopted to code the target video.

In a second mode, if the fluctuation information includes the number of key frame applications of the network, the step a includes the following steps:

step A-21, if the difference value between the number of key frame applications detected in the current detection window and the number of key frame applications detected in the last detection window is larger than or equal to a preset value, determining to encode the target video by adopting a layered encoding mode;

and step A-22, if the difference between the number of the key frame applications detected in the current detection window and the number of the key frame applications detected in the last detection window is smaller than a preset value, determining that a layered coding mode is not adopted to code the target video.

The specific value of the preset value is not limited in the embodiment of the application, and the specific value is a value greater than 0. Optionally, the preset value may also be a fixed threshold.

In the second mode, if the number of the key frames is detected to be obviously increased in the current detection window, the situation that the network is jammed at the moment is indicated, the current network environment state is unstable can be judged, the layered coding is prepared to be started automatically, otherwise, the network environment is regarded as a normal network environment, the layered coding is not required to be started, and the network fluctuation condition is continuously monitored.

Specifically, comparing the number a of key frame applications detected in the current detection window with the number b of key frame applications detected in the last detection window, if the difference between the number a of key frame applications detected in the current detection window and the number b of key frame applications detected in the last detection window is greater than or equal to a preset value, indicating that the current network is unstable, and determining to encode the target video by adopting a layered encoding mode. If the difference between the key frame application number a detected in the current detection window and the key frame application number b detected in the last detection window is smaller than a preset value, the current network is stable, and it is determined that the target video is not encoded in a layered encoding mode.

According to the embodiment of the application, the opening and closing of the layered coding mode is automatically controlled through real-time detection of the network state, so that the self-adaptive selection of the layered coding mode is realized, and the coding quality and efficiency are improved.

In the embodiment of the present application, when it is determined that the hierarchical coding mode is adopted to code the target video according to the fluctuation information of the network, the step S403 is executed.

In some embodiments, if the decoding capability information of the terminal device further includes the maximum number of decoding reference frames supported by the terminal device, and the parameters corresponding to the layered coding manner include the number of layers, the step S403 includes the following steps S403-A1 and S403-A2:

S403-A1, determining the initial layering quantity corresponding to the layering coding mode according to the minimum fluctuation value of the network and the stable value of the network.

In the embodiment of the present application, when the current network environment state is determined to be unstable according to the above steps and the layered coding needs to be started, the initial layer number corresponding to the layered coding mode is determined according to the detection result of the network fluctuation condition, that is, the network fluctuation information. Specifically, according to the minimum fluctuation value of the network and the stable value of the network, the initial layering quantity corresponding to the layering coding mode is determined.

The method for determining the initial layering quantity corresponding to the layering coding mode is not limited according to the minimum fluctuation value of the network and the stable value of the network.

In one possible implementation, a value obtained by rounding a ratio between a stable value of the network and a minimum fluctuation value of the network is determined as an initial hierarchical number corresponding to the hierarchical coding mode. For example, if the ratio between the stable value of the network and the minimum fluctuation value of the network is 2, it is determined that the initial number of layers corresponding to the layered coding mode is 2.

In another possible implementation manner, if the minimum fluctuation value of the network is greater than or equal to the ratio of the stable value to N and less than the ratio of the stable value to N-1, determining the initial number of layers corresponding to the layered coding mode as N, where N is a positive integer greater than 1.

For example, the minimum fluctuation value > = network stability value/2 of the network indicates that the worst network environment is reduced by half bandwidth, that is, half of video frames need to be discarded to ensure smooth data transmission, and at this time, double-layer hierarchical coding is adopted, that is, half of video frames are located in a base layer and cannot be discarded, and the other half of video frames are located in an enhancement layer and can be flexibly discarded according to the network condition.

For another example, the network stability value/2 > is the minimum fluctuation value of the network > = network stability value/3, which indicates that the worst network environment is close to shrinking 2/3 bandwidth, i.e. 2/3 video frames need to be discarded to ensure smooth data transmission, and at this time, three layers of layered coding are adopted to realize, i.e. 1/3 video frames are located in the base layer, cannot be discarded, and 2/3 video frames are located in the enhancement layer, so that the video frames can be flexibly discarded according to the network condition.

The greater the number of layers corresponding to the layered coding mode, the lower the coding efficiency and quality, and in order to ensure the coding efficiency and the image quality of the coded video, the number of layers cannot be too great, i.e., the N cannot exceed a preset value.

That is, when the initial number of layers N corresponding to the layered coding manner determined by the method is greater than a preset value, it is indicated that the network fluctuation is obvious, and at this time, the cloud server may reduce the current coding frame rate, that is, set to a preset coding frame rate, where the preset coding frame rate is smaller than the current coding frame rate, and then apply the adaptive layered coding strategy provided by the embodiment of the present application to ensure the image quality of the coded video.

The method of the embodiment of the application can adaptively select the layering quantity according to the network condition so as to adapt to different network jitter or network congestion conditions and provide different quantities of discardable frames.

S403-A2, determining the final layering quantity corresponding to the layering coding mode according to the initial layering quantity and the maximum decoding reference frame quantity.

The above-determined initial hierarchical quantity considers the fluctuation condition of the network, but does not consider the decoding performance of the terminal device, and when the above-determined initial hierarchical quantity is directly used for hierarchical coding, there may be a situation that the decoding capability of the terminal device does not meet the requirement of the initial hierarchical quantity.

In order to solve the calculation problem, the embodiment of the application adjusts the determined initial layering quantity according to the maximum decoding reference frame quantity of the terminal equipment to obtain the final layering quantity corresponding to the layering coding mode.

That is, the cloud server can select a proper layering number according to the decoding capability of the terminal device, fully utilize the decoding capability of the terminal device, and improve the quality of layering coding.

In the embodiment of the present application, according to the initial layer number and the maximum decoding reference frame number, the ways of determining the final layer number corresponding to the layer coding manner include, but are not limited to, the following:

In one mode, if the maximum number of decoded reference frames is greater than or equal to the initial number of layers, the initial number of layers is determined to be the final number of layers.

If the maximum number of decoded reference frames is greater than or equal to the initial number of layers, the decoding capability of the terminal device is sufficient to decode the encoded code stream of the initial number of layers, so the determined initial number of layers may be determined as the final number of layers.

In the second mode, if the maximum number of decoded reference frames is smaller than the initial number of layers, the maximum number of decoded reference frames is determined as the final number of layers.

In the second mode, if the maximum decoding reference frame number of the terminal device is smaller than the initial layering number, it is indicated that the decoding capability of the terminal device cannot correctly decode the code stream encoded by the initial layering number. At this time, the above-determined initial number of layers needs to be adjusted, for example, the initial number of layers is adjusted to be less than or equal to the number of maximum decoding reference frames of the terminal device, and the adjusted value is determined to be the final number of layers corresponding to the layered coding scheme.

In one possible implementation, if the maximum number of decoded reference frames is less than the initial number of layers, the maximum number of decoded reference frames is directly determined as the final number of layers.

For example, for double layer coding, the maximum number of decoded reference frames for the terminal device > =2 is required, for triple layer coding, the maximum number of decoded reference frames for the terminal device > =3 is required, and so on. If the maximum number of decoded reference frames of the terminal device does not meet the above-determined requirement of the initial number of layers, for example, the initial number of layers is 3, but the maximum number of decoded reference frames of the terminal device is 2, the initial number of layers is automatically reduced to a configuration supported by the maximum number of decoded reference frames, that is, the determined final number of layers is 2.

According to the embodiment of the application, the final layering quantity corresponding to the layering coding mode can be determined according to the minimum fluctuation value of the network and the maximum decoding reference frame quantity of the terminal equipment, so that when the target video is coded, the target video can be coded by adopting the final layering quantity, and a layering coded code stream is obtained. When the terminal equipment receives the code stream of the layered coding, the terminal equipment can realize the accurate decoding of the code stream, thereby improving the coding and decoding quality of the video.

In some embodiments, if the parameters corresponding to the layered coding scheme include the number of coding reference frames, S403 includes the following S403-B:

S403-B, determining the number of coding reference frames according to the final layering number and the maximum decoding reference frame number.

In this embodiment of the present application, if the parameters corresponding to the hierarchical coding manner include the number of coded reference frames, the number of coded reference frames may be determined according to the determined final number of layers corresponding to the hierarchical coding manner and the maximum number of decoded reference frames of the terminal device. That is, in the embodiment of the present application, when the terminal device has a phenomenon that the decoding speed is too slow, the phenomenon can be alleviated by discarding the enhancement frame.

The implementation of S403-B includes, but is not limited to, the following:

in one aspect, if the layered coding mode is not started, the number of coded reference frames is determined to be a positive integer less than the maximum number of decoded reference frames. If the layered coding mode is started, determining the number of the coding reference frames as a positive integer which is smaller than or equal to the ratio between the maximum decoding reference frame number and the final layered number.

For example, for double layer coding, when SVC is not turned on, the number of coded reference frames < = maximum number of decoded reference frames, when SVC is turned on, the encoder needs to be reset and the set number of coded reference frames < = (maximum number of decoded reference frames/2) is adjusted.

Since increasing the number of reference frames can improve the coding quality, it is recommended that the number of reference frames be as large as possible when SVC is turned on/off, considering the influence of external factors and other reasons such as coding time consumption. For example, the maximum decoding reference frame buffer size of the terminal device is 4 frames, when the SVC is not turned on, the recommended number of encoding reference frames is set to 4 frames, when the SVC is turned on, the recommended number of encoding reference frames is 2 frames, and the specific reference relationship is as shown in fig. 5, wherein the gray frames are discardable enhancement frames.

For three-layer coding, when SVC is not turned on, the number of coded reference frames < = maximum number of decoded reference frames, and when SVC is turned on, the encoder needs to be reset and the number of coded reference frames < = (maximum number of decoded reference frames/3) is adjusted and set. For example, the maximum number of decoded reference frames of the terminal device is 6 frames, and when the SVC is not turned on, the number of encoded reference frames is set to 6 frames at maximum, and when the SVC is turned on, the number of encoded reference frames is set to 2 frames at maximum, and the specific reference relationship is as shown in fig. 6, wherein the gray frames are discardable enhancement frames.

And if the layered coding mode is started and not started, determining that the number of the coding reference frames is a positive integer which is smaller than or equal to the ratio between the maximum decoding reference frame number and the final layered number. Thus, the encoder is not required to be reset when the SVC is temporarily started, and only the reference frame relation is required to be modified.

It should be noted that, the first and second modes just determine one implementation mode of coding the reference frame, and the implementation method of multi-frame layered coding has diversity, so long as the embodiment of the application ensures that the number of the discardable frames (not referenced by the base layer frame) meets the coding requirement, and the furthest distance < = decoding reference frame maximum buffer area of the reference frame and the current frame.

As can be seen from the foregoing, in the embodiments of the present application, layered coding can be flexibly and automatically turned on/off according to network conditions, and the number of layers is adaptively selected to adapt to different network jitter or network congestion conditions, so as to provide different numbers of discardable frames.

In addition, the cloud server can select proper layering quantity according to the decoding capability of the terminal equipment, fully utilize the decoding capability of the terminal equipment and improve the quality of layering coding. When the terminal has too slow decoding speed, the phenomenon can be relieved by discarding the enhancement frames.

According to the layered coding method provided by the embodiment of the application, the cloud server acquires decoding capability information of the terminal equipment, wherein the decoding capability information comprises decoding chip types included in the terminal equipment; detecting a network according to the type of the decoding chip when the decoding chip supporting the layered decoding mode exists in the terminal equipment, so as to obtain fluctuation information of the network; and determining parameters corresponding to the layered coding mode according to the fluctuation information of the network and the decoding capability information of the terminal equipment, and coding the target video by using the parameters corresponding to the layered coding mode. Namely, according to the embodiment of the application, the optimal hierarchical coding configuration is selected according to the network fluctuation condition and the decoding capability of the terminal equipment, so that the coding quality is improved.

Fig. 7 is a flowchart of a layered coding method according to an embodiment of the present application, and fig. 7 is a flowchart of a specific embodiment of the present application, including:

s701, decoding capability information of the terminal device is acquired.

Wherein the decoding capability information comprises a decoding chip type included in the terminal device.

S702, judging whether a decoding chip supporting a layered decoding mode exists in the terminal equipment.

If the terminal device has a decoding chip supporting the layered decoding method, the following S703 is executed, and if the terminal device has no decoding chip supporting the layered decoding method, the following S706 is executed.

S703, detecting the network to obtain the fluctuation information of the network.

The fluctuation information of the network comprises at least one of the fluctuation times, the fluctuation variance, the key frame application number, the maximum fluctuation value and the minimum fluctuation value of the network detected in the current detection window.

S704, judging whether the network fluctuation is serious.

If it is determined that the current network fluctuation is serious, the following step S705 is executed, and if it is determined that the current network fluctuation is not serious, the above step S703 is executed again, and the monitoring of the network is continued.

The process of determining whether the current network fluctuation is serious may refer to the related description of the step a, which is not repeated herein.

And if the current network fluctuation is serious, determining to start a layered coding mode to code the target video. And if the current network fluctuation is not serious, not starting the layered coding mode.

S705, determining the final layering quantity and the coding reference frame quantity corresponding to the layering coding mode according to the fluctuation information of the network and the maximum decoding reference frame quantity of the terminal equipment.

S706, the layered coding mode is not supported.

According to the layered coding method provided by the embodiment of the application, the cloud server acquires the decoding capability information of the terminal equipment and the fluctuation information of the network, and determines the final layered quantity and the coding reference frame quantity corresponding to the layered coding mode according to the decoding capability information of the terminal equipment and the fluctuation information of the network, so that the network condition and the decoding capability of the terminal equipment are fully considered, and the coding quality is further improved.

Fig. 8 is an interactive flowchart of a layered coding method according to an embodiment of the present application, and fig. 8 is a specific embodiment of the present application, including:

s801, the cloud server sends a decoding capability request to the terminal equipment through the client.

In case 1, the decoding capability request is the first decoding capability request, and the first decoding capability request is used to request a decoding parameter of the terminal device. That is, the cloud server transmits a first decoding capability request to the terminal device through the client, and the terminal device receives the first decoding capability request and then transmits first response information to the cloud server through the client, wherein the first response information includes decoding capability information of the terminal device.

In case 2, if the decoding capability information includes the maximum number of decoding reference frames supported by the terminal device, the decoding capability request may be a second decoding capability request in the foregoing embodiment, where the second decoding capability request is used to request a decoding result of the terminal device with respect to the video code streams with different reference frame numbers, and the second decoding capability request includes the video code streams with different reference frame numbers. That is, the cloud server transmits a second decoding capability request including video code streams of different reference frame numbers to the terminal device through the client. After receiving the second decoding capability request, the terminal equipment analyzes the second decoding capability request to obtain video code streams with different reference frame numbers, and decodes the video code streams with different reference frame numbers respectively to obtain decoding results of the terminal equipment about the video code streams with different reference frame numbers. And then, the terminal equipment sends the decoding results of the video code streams of the terminal equipment about the different reference frame numbers to the cloud server through the client, and the cloud server determines the maximum decoding reference frame number supported by the terminal equipment according to the decoding results of the video code streams of the terminal equipment about the different reference frame numbers.

The cloud server initiates a request to the terminal equipment through the START client to acquire decoding capability information of the terminal equipment.

Exemplary, the codes for the decode capability request are shown in Table 4:

TABLE 4 Table 4

In some embodiments, if the decoding capability request is the second decoding capability request, the decoding capability request may further include video code streams with different reference frame numbers.

S802, the terminal equipment returns response information to the cloud server through the client.

Alternatively, the response information may be the first response or the second response.

In some embodiments, the response information returned by the terminal device may include several examples as follows.

In example 1, if the terminal device supports all decoding protocols, the returned response information is shown in table 5:

TABLE 5

/>

In example 2, if the terminal device supports only a partial decoding protocol, the returned response information is shown in table 6:

TABLE 6

/>

Example 3, if the terminal device does not support the partial decoding protocol, the returned response information is shown in table 7:

TABLE 7

Example 4, if the terminal device capability information request fails, the returned response information is shown in table 8:

TABLE 8

It should be noted that, table 4 to table 8 are examples of response information of the terminal device, and in the embodiment of the present application, the format of the response information returned by the terminal device includes, but is not limited to, those shown in table 4 to table 8.

S803, the cloud server determines the optimal decoding configuration of the terminal equipment.

In the embodiment of the application, after receiving the terminal rendering capability information, the cloud server determines the optimal rendering collaborative configuration of the current terminal equipment, and sends the collaborative task to the terminal equipment.

After receiving the terminal decoding capability information, the cloud server combines the game type and the user network condition to determine the optimal decoding protocol, decoding resolution, video frame rate, layered decoding mode and other encoding and decoding configuration of the current terminal equipment, and video coding reference frame number, SVC enabling and other encoding and decoding strategies.

Optionally, the cloud server may further perform the step of S804 described above.

S804, the cloud server sends the optimal decoding configuration to the terminal equipment through the client.

And the cloud server sends the determined optimal decoding configuration of the terminal equipment to the terminal equipment so that the terminal equipment decodes the video stream after receiving the optimal decoding configuration.

It should be understood that fig. 4-8 are only examples of the present application and should not be construed as limiting the present application.

The preferred embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described in detail. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be considered as disclosed herein.

Method embodiments of the present application are described in detail above in connection with fig. 4-8, and apparatus embodiments of the present application are described in detail below.

Fig. 9 is a schematic structural diagram of a layered coding apparatus according to an embodiment of the present application, where the apparatus is applied to a cloud server, and the apparatus 10 includes:

an obtaining unit 11, configured to obtain decoding capability information of a terminal device, where the decoding capability information includes a decoding chip type included in the terminal device;

the network detection unit 12 is configured to detect a network according to the type of the decoding chip, and obtain fluctuation information of the network when it is determined that the decoding chip supporting the layered decoding mode exists in the terminal device;

and the encoding unit 13 is configured to determine parameters corresponding to a layered coding mode according to the fluctuation information of the network and the decoding capability information of the terminal device, and encode the target video by using the parameters corresponding to the layered coding mode.

In some embodiments, the encoding unit 13 is further configured to determine, according to the fluctuation information of the network, whether to encode the target video using the hierarchical encoding mode; and when the target video is determined to be encoded by using the layered coding mode, determining parameters corresponding to the layered coding mode according to the fluctuation information of the network and the decoding capability information of the terminal equipment.

In some embodiments, the fluctuation information of the network includes at least one of a number of fluctuation of the network detected within a current detection window, a fluctuation variance, a key frame application number, a maximum fluctuation value, and a minimum fluctuation value.

In some embodiments, if the fluctuation information includes at least one of a fluctuation frequency, a fluctuation variance, a fluctuation maximum value, and a fluctuation minimum value of the network, the encoding unit 13 is specifically configured to determine to encode the target video in the layered coding manner if at least one of the fluctuation frequency, the fluctuation variance, the maximum fluctuation value, and the minimum fluctuation value of the network is greater than or equal to a corresponding preset threshold; and if the fluctuation times, the fluctuation variance, the maximum fluctuation value and the minimum fluctuation value of the network are smaller than the corresponding preset threshold values, determining that the layered coding mode is not adopted to code the target video.

In some embodiments, if the fluctuation information includes the number of key frame applications of the network, the encoding unit 13 is specifically configured to determine to encode the target video in the layered encoding manner if a difference between the number of key frame applications detected in the current detection window and the number of key frame applications detected in a previous detection window is greater than or equal to a preset value; and if the difference value between the number of the key frame applications detected in the current detection window and the number of the key frame applications detected in the last detection window is smaller than a preset value, determining that the hierarchical coding mode is not adopted to code the target video.

In some embodiments, the decoding capability information of the terminal device further includes a maximum number of decoding reference frames supported by the terminal device, the parameter corresponding to the layered coding manner includes a number of layers, and the coding unit 13 is specifically configured to determine, according to a minimum fluctuation value of the network and a stable value of the network, an initial number of layers corresponding to the layered coding manner; and determining the final layering quantity corresponding to the layering coding mode according to the initial layering quantity and the maximum decoding reference frame quantity.

In some embodiments, the encoding unit 13 is specifically configured to determine that the initial number of layers corresponding to the layered coding mode is N, where N is a positive integer greater than 1, if the minimum fluctuation value of the network is greater than or equal to the ratio of the stable value to N and is smaller than the ratio of the stable value to N-1.

Optionally, if N is greater than a preset value, the encoding unit 13 is further configured to set a current encoding frame rate to a preset encoding frame rate, where the preset encoding frame rate is smaller than the current encoding frame rate.

In some embodiments, the encoding unit 13 is specifically configured to determine the initial number of layers as the final number of layers if the maximum number of decoded reference frames is greater than or equal to the initial number of layers; and if the maximum decoding reference frame number is smaller than the initial layering number, determining the maximum decoding reference frame number as the final layering number.

In some embodiments, the parameters corresponding to the layered coding manner include a number of coded reference frames, and the coding unit 13 is specifically configured to determine the number of coded reference frames according to the final layered number and the maximum number of decoded reference frames.

In some embodiments, the encoding unit 13 is specifically configured to determine, if the layered coding mode is on, that the number of encoded reference frames is a positive integer less than or equal to a ratio between the maximum number of decoded reference frames and the final layered number; and if the layered coding mode is not started, determining that the number of the coding reference frames is a positive integer smaller than or equal to the ratio between the maximum decoding reference frame number and the final layered number, or determining that the number of the coding reference frames is a positive integer smaller than the maximum decoding reference frame number.

In some embodiments, the maximum number of decoding reference frames supported by the terminal device is the maximum number of decoding reference frames corresponding to a decoding chip that supports the layered decoding manner and has optimal decoding performance in the terminal device.

In some embodiments, the obtaining unit 11 is specifically configured to obtain the decoding capability information of the terminal device by accessing a hardware interface of the terminal device; or, the obtaining unit 11 is specifically configured to send a first decoding capability request to the terminal device, and receive first response information sent by the terminal device according to the first decoding capability request, where the first decoding capability request is used to request a decoding parameter of the terminal device, and the first response information includes decoding capability information of the terminal device.

In some embodiments, if the decoding capability information includes the maximum number of decoding reference frames supported by the terminal device, the obtaining unit 11 is specifically configured to send a second decoding capability request to the terminal device, where the second decoding capability request includes video code streams with different numbers of reference frames, and the second decoding capability request is used to request a decoding result of the terminal device with respect to the video code streams with different numbers of reference frames; receiving second response information sent by the terminal equipment, wherein the second response information comprises decoding results of video code streams of different reference frame numbers of the terminal equipment; and determining the maximum decoding reference frame number supported by the terminal equipment according to the decoding result.

It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 20 shown in fig. 9 may perform the above-described method embodiments, and the foregoing and other operations and/or functions of each module in the apparatus 9 are respectively for implementing the above-described method embodiments shown in fig. 4 and 5, and are not repeated herein for brevity.

The apparatus of the embodiments of the present application are described above in terms of functional modules in conjunction with the accompanying drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiments in the embodiments of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in software form, and the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

Fig. 10 is a schematic block diagram of an electronic device provided in an embodiment of the present application, where the electronic device may be a terminal device and/or a server as described above.

As shown in fig. 10, the electronic device 40 may include:

a memory 41 and a memory 42, the memory 41 being adapted to store a computer program and to transfer the program code to the memory 42. In other words, the memory 42 may call and run a computer program from the memory 41 to implement the methods in the embodiments of the present application.

For example, the memory 42 may be used to perform the method embodiments described above in accordance with instructions in the computer program.

In some embodiments of the present application, the memory 42 may include, but is not limited to:

a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

In some embodiments of the present application, the memory 41 includes, but is not limited to:

volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).

In some embodiments of the present application, the computer program may be partitioned into one or more modules that are stored in the memory 41 and executed by the memory 42 to perform the methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing particular functions for describing the execution of the computer program in the video production device.

As shown in fig. 10, the electronic device 40 may further include:

a transceiver 40, the transceiver 43 may be connected to the memory 42 or the memory 41.

The memory 42 may control the transceiver 43 to communicate with other devices, and in particular, may transmit information or data to other devices or receive information or data transmitted by other devices. The transceiver 43 may include a transmitter and a receiver. The transceiver 43 may further include antennas, the number of which may be one or more.

It will be appreciated that the various components in the video production device are connected by a bus system that includes, in addition to a data bus, a power bus, a control bus and a status signal bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.

When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces, in whole or in part, a flow or function consistent with embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A layered coding method, comprising:

2. The method according to claim 1, wherein before determining parameters corresponding to the layered coding scheme according to fluctuation information of the network and decoding capability information of the terminal device, the method further comprises:

and determining whether to encode the target video by using the layered coding mode according to the fluctuation information of the network.

3. The method of claim 2, wherein the fluctuation information of the network includes at least one of a number of fluctuation, a fluctuation variance, a number of key frame applications, a maximum fluctuation value, and a minimum fluctuation value of the network detected within a current detection window.

4. The method according to claim 3, wherein if the fluctuation information includes at least one of a fluctuation number, a fluctuation variance, a fluctuation maximum value, and a fluctuation minimum value of the network, the determining whether to encode the target video in a layered coding manner according to the fluctuation information of the network includes:

If at least one of the fluctuation times, the fluctuation variance, the maximum fluctuation value and the minimum fluctuation value of the network is larger than or equal to a corresponding preset threshold value, determining to code the target video by adopting the layered coding mode;

and if the fluctuation times, the fluctuation variance, the maximum fluctuation value and the minimum fluctuation value of the network are smaller than the corresponding preset threshold values, determining that the layered coding mode is not adopted to code the target video.

5. The method according to claim 3, wherein if the fluctuation information includes a key frame application number of the network, determining whether to encode the target video in a layered coding manner according to the fluctuation information of the network includes:

if the difference value between the number of the key frame applications detected in the current detection window and the number of the key frame applications detected in the last detection window is larger than or equal to a preset value, determining to encode the target video by adopting the layered encoding mode;

and if the difference value between the number of the key frame applications detected in the current detection window and the number of the key frame applications detected in the last detection window is smaller than a preset value, determining that the hierarchical coding mode is not adopted to code the target video.

6. The method according to any one of claims 1-5, wherein the decoding capability information of the terminal device further includes a maximum number of decoding reference frames supported by the terminal device, the parameters corresponding to the layered coding scheme include a number of layers, and determining the parameters corresponding to the layered coding scheme according to the fluctuation information of the network and the decoding capability information of the terminal device includes:

determining the initial layering quantity corresponding to the layering coding mode according to the minimum fluctuation value of the network and the stable value of the network;

and determining the final layering quantity corresponding to the layering coding mode according to the initial layering quantity and the maximum decoding reference frame quantity.

7. The method of claim 6, wherein determining the initial number of layers corresponding to the layered coding scheme according to the minimum fluctuation value of the network and the stable value of the network comprises:

if the minimum fluctuation value of the network is larger than or equal to the ratio of the stable value to N and smaller than the ratio of the stable value to N-1, determining the initial layering quantity corresponding to the layering coding mode as N, wherein N is a positive integer larger than 1.

8. The method of claim 7, wherein if N is greater than a preset value, the method further comprises:

setting a current coding frame rate to a preset coding frame rate, wherein the preset coding frame rate is smaller than the current coding frame rate.

9. The method according to claim 6, wherein determining a final number of layers corresponding to the layered coding scheme according to the initial number of layers and the maximum number of decoded reference frames comprises:

if the maximum decoding reference frame number is greater than or equal to the initial layering number, determining the initial layering number as the final layering number;

and if the maximum decoding reference frame number is smaller than the initial layering number, determining the maximum decoding reference frame number as the final layering number.

10. The method according to claim 6, wherein the parameters corresponding to the layered coding scheme include a number of coded reference frames, and the determining the parameters corresponding to the layered coding scheme according to the fluctuation information of the network and the decoding capability information of the terminal device includes:

and determining the number of coding reference frames according to the final layering number and the maximum decoding reference frame number.

11. The method of claim 10, wherein said determining the number of encoded reference frames based on said final number of layers and said maximum number of decoded reference frames comprises:

if the layered coding mode is started, determining the number of the coding reference frames as a positive integer which is smaller than or equal to the ratio between the maximum decoding reference frame number and the final layered number;

and if the layered coding mode is not started, determining that the number of the coding reference frames is a positive integer smaller than or equal to the ratio between the maximum decoding reference frame number and the final layered number, or determining that the number of the coding reference frames is a positive integer smaller than the maximum decoding reference frame number.

12. The method of claim 6, wherein the maximum number of decoded reference frames supported by the terminal device is a maximum number of decoded reference frames corresponding to a decoding chip in the terminal device that supports the layered decoding mode and has an optimal decoding performance.

13. The method according to any one of claims 1-5, wherein the obtaining decoding capability information of the terminal device comprises:

obtaining decoding capability information of the terminal equipment by accessing a hardware interface of the terminal equipment; or alternatively, the process may be performed,

And sending a first decoding capability request to the terminal equipment, and receiving first response information sent by the terminal equipment according to the first decoding capability request, wherein the first decoding capability request is used for requesting decoding parameters of the terminal equipment, and the first response information comprises decoding capability information of the terminal equipment.

14. The method according to any of claims 1-5, wherein the obtaining the decoding capability information of the terminal device if the decoding capability information includes a maximum number of decoding reference frames supported by the terminal device comprises:

transmitting a second decoding capability request to the terminal device, wherein the second decoding capability request comprises video code streams with different reference frame numbers, and the second decoding capability request is used for requesting decoding results of the terminal device on the video code streams with different reference frame numbers;

receiving second response information sent by the terminal equipment, wherein the second response information comprises decoding results of video code streams of different reference frame numbers of the terminal equipment;

and determining the maximum decoding reference frame number supported by the terminal equipment according to the decoding result.

15. A layered coding apparatus, comprising:

16. An electronic device, comprising:

a processor and a memory for storing a computer program, the processor being for invoking and running the computer program stored in the memory to perform the method of any of claims 1 to 14.

17. A computer storage medium comprising computer program instructions that cause a computer to perform the method of any of claims 1 to 14.