WO2022116165A1 - 视频编码方法、解码方法、编码器、解码器以及ai加速器 - Google Patents

视频编码方法、解码方法、编码器、解码器以及ai加速器 Download PDF

Info

Publication number
WO2022116165A1
WO2022116165A1 PCT/CN2020/133979 CN2020133979W WO2022116165A1 WO 2022116165 A1 WO2022116165 A1 WO 2022116165A1 CN 2020133979 W CN2020133979 W CN 2020133979W WO 2022116165 A1 WO2022116165 A1 WO 2022116165A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
network model
information
parameter set
encoding
Prior art date
Application number
PCT/CN2020/133979
Other languages
English (en)
French (fr)
Inventor
周焰
郑萧桢
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2020/133979 priority Critical patent/WO2022116165A1/zh
Priority to CN202080081315.7A priority patent/CN114868390A/zh
Publication of WO2022116165A1 publication Critical patent/WO2022116165A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/117Filters, e.g. for pre-processing or post-processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/13Adaptive entropy coding, e.g. adaptive variable length coding [AVLC] or context adaptive binary arithmetic coding [CABAC]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/80Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation
    • H04N19/82Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation involving filtering within a prediction loop

Definitions

  • the present disclosure relates to the field of information technology, and in particular, to a video encoding method, a neural network-based video encoding method, a video decoding method, a neural network-based video decoding method, a video encoder, a video decoder, and an AI accelerator for video encoding And an AI accelerator for video decoding.
  • the technology based on neural network is more and more widely used in various fields. Since the technology based on neural network can usually obtain better results than the traditional technology, some coding tools based on neural network are also introduced in the video coding technology. However, most of the existing video coding techniques using neural networks are implemented based on a specific coding standard, which has a high coupling with the specific coding standard, which reduces the scope of application of the neural network-based video coding techniques.
  • the present disclosure provides a video encoding method, a neural network-based video encoding method, a video decoding method, a neural network-based video decoding method, a video encoder, a video decoder, an AI accelerator for video encoding, and a video decoding method
  • the AI accelerator can enable the video coding technology using neural network to be used independently of or compatible with existing coding standards, overcoming the defects of related technologies.
  • an embodiment of the present disclosure provides a video encoding method, the method includes: performing encoding processing on video data, the encoding processing including encoding processing using a neural network model; A codestream with syntax elements containing information characterizing a parameter set of the neural network model.
  • embodiments of the present disclosure provide a neural network-based video encoding method, the method comprising: encoding video data by using a neural network model; sending a parameter set of the neural network model to a video encoder, so that the video encoder generates a code stream carrying syntax elements based on the encoded video data, and the syntax elements include information representing a parameter set of the neural network model.
  • an embodiment of the present disclosure provides a video decoding method, the method includes: parsing a received video code stream, and obtaining a syntax element of the video code stream, where the syntax element includes a code representing a neural network model information of the parameter set; according to the syntax element, use the neural network model corresponding to the parameter set to decode the video stream.
  • an embodiment of the present disclosure provides a neural network-based video decoding method, the method comprising: acquiring a syntax element obtained after parsing a video stream by a video decoder, where the syntax element includes parameters representing a neural network model information of the set; according to the syntax element, use the neural network model corresponding to the parameter set to decode the video stream.
  • an embodiment of the present disclosure provides a video encoder, the encoder includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the program
  • the following method is implemented: encoding processing is performed on video data, the encoding processing includes encoding processing using a neural network model; based on the encoded video data, a code stream carrying syntax elements is generated, and the syntax elements include representing the neural network. Information about the parameter set of the network model.
  • an embodiment of the present disclosure provides a video decoder, the video decoder includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing the program
  • the following method is implemented when the received video code stream is parsed, and the syntax element of the video code stream is obtained, and the syntax element contains the information representing the parameter set of the neural network model;
  • the neural network model corresponding to the parameter set performs decoding processing on the video code stream.
  • embodiments of the present disclosure provide an AI accelerator for video encoding, the AI accelerator comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing The following methods are implemented during the program: encoding the video data; sending the parameter set of the neural network to the video encoder, so that the video encoder generates a code stream that carries syntax elements based on the encoded video data , the syntax element contains information characterizing the set of parameters of the neural network.
  • embodiments of the present disclosure provide an AI accelerator for video decoding, where the AI accelerator includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor executing The program implements the following methods: acquiring syntax elements obtained after the video decoder parses the video code stream, the syntax elements include information representing the parameter set of the neural network; The neural network model decodes the video stream.
  • an embodiment of the present disclosure provides a machine-readable storage medium, where several computer instructions are stored on the machine-readable storage medium, and when the computer instructions are executed, any one of the first to fourth aspects of the present disclosure is implemented.
  • the code stream carries the syntax element including the information representing the parameter set of the neural network model.
  • the data is encoded based on a neural network. Since the code stream carries syntax elements that contain information representing the parameter set of the neural network model, the syntax elements can exist in the code stream independent of existing video coding standards, thereby enabling the neural network-based video coding technology It can be implemented independently; the syntax elements can also be located in a variety of existing coding standards, so as to be compatible with existing video coding standards. With the method of the embodiment of the present disclosure, the coupling between the neural network-based intelligent video coding technology and a specific video coding standard can be reduced, and the applicable scope of the neural network-based video coding technology can be expanded.
  • FIG. 1 is a schematic diagram of a conventional hybrid video coding framework according to an embodiment of the present disclosure.
  • FIG. 2 is a schematic diagram of a conventional hybrid video decoding framework according to an embodiment of the present disclosure.
  • FIG. 3 is a flowchart of a video encoding method according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a designated position of a reserved field in a code stream according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of a hierarchical code stream structure of a video coding standard according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of a code stream generated by a video encoding process according to an embodiment of the present disclosure.
  • FIG. 7 is a partial structural schematic diagram of a code stream produced by a video encoding process according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of a video coding framework including a neural network-based loop filtering technology according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of a video coding framework including two neural network-based technologies according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of a syntax element located in a reserved field according to an embodiment of the present disclosure.
  • FIG. 11 is a flowchart of a video coding method based on a neural network according to an embodiment of the present disclosure.
  • FIG. 12 is a flowchart of a video decoding method according to an embodiment of the present disclosure.
  • FIG. 13 is a flowchart of a video decoding method based on a neural network according to an embodiment of the present disclosure.
  • FIG. 14 is a schematic structural diagram of an encoder according to an embodiment of the present disclosure.
  • first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure.
  • word "if” as used herein can be interpreted as "at the time of” or "when” or "in response to determining.”
  • the unencoded video Due to the large amount of data of the unencoded video, the unencoded video will occupy a large storage space during storage, and will occupy a large communication resource during transmission. Therefore, for unencoded video data, a video encoding technology is usually used to reduce the storage space of the video data and the communication resources used in the transmission process.
  • Video coding technology uses some data processing methods to compress and encode the video data to be coded to form a code stream, which is stored or sent to the decoding end.
  • the code stream stored or received by the decoding end is reconstructed by decoding technology to obtain reconstructed video data.
  • FIG. 1 it is a conventional hybrid video coding framework including predictive coding 101 , transform coding 102 , quantization 103 and entropy coding 104 .
  • the predictive coding 101 is to remove the temporal and spatial correlations of the video data to be coded by utilizing the spatial correlation of the intra-frame pixels and the temporal correlation of the inter-frame pixels in the video data to be coded. redundant information.
  • Intra-frame prediction is used to remove redundant information of video data in the spatial domain
  • inter-frame prediction is used to remove redundant information of video data in the temporal domain.
  • the intra prediction method generates a prediction block based on the pixels of adjacent blocks around the block to be coded; the inter prediction method obtains the prediction block by searching the image block in the reference frame image that best matches the current block to be coded.
  • the corresponding pixel values of the block to be encoded and the corresponding prediction block are subtracted to obtain a residual, and the obtained residuals corresponding to the blocks to be encoded are combined to obtain the to-be-coded block. Residual data for encoded image frames.
  • the transform coding 102 transforms the residual data of the image frame to be coded from the spatial domain to the frequency domain, and obtains transform coefficients, so as to remove the correlation of the spatial signal and improve the coding efficiency.
  • Quantization 103 is the process of reducing the precision of data representation.
  • the quantized transform coefficients are obtained by quantization, which can reduce the amount of data to be encoded and further improve the compression efficiency.
  • the entropy coding 104 performs rate compression by utilizing the information entropy of the information source. Entropy coding is performed on the quantized transform coefficients and the prediction mode information (including intra-frame prediction mode, motion vector information, reference frame and other information) generated during the prediction encoding process, which can remove the statistics that still exist after prediction, transformation and quantization. Redundant information to obtain the code stream.
  • the prediction mode information including intra-frame prediction mode, motion vector information, reference frame and other information
  • a decoding framework corresponding to the above-mentioned traditional hybrid video coding framework is presented: including entropy decoding 201 , inverse quantization 202 , inverse transform 203 and prediction reconstruction 204 .
  • Video decoding is the inverse process of video encoding, which will not be described in detail here.
  • the present disclosure provides a video encoding method to overcome the defects of the related art.
  • FIG. 3 a flowchart of a video encoding method provided by the present disclosure is provided, and the method includes:
  • Step 301 Encode the video data, and the encoding process includes an encoding process using a neural network model.
  • Step 302 Based on the encoded video data, generate a code stream carrying syntax elements, where the syntax elements include information representing a parameter set of the neural network model.
  • the encoding processing of the video data can be realized by using an end-to-end neural network without relying on the traditional hybrid video encoding framework. That is, the user can select a specific deep learning framework, use the video data to construct a data set, and train the neural network model, so that the trained neural network model can be used to input the video data to be encoded into the trained neural network. After the model, a code stream that meets certain requirements can be generated. The obtained code stream can be decoded, and the relevant parameters such as the compression rate and distortion degree of the video data reconstructed based on the code stream can meet the needs of the user.
  • the traditional hybrid video coding framework may be the coding framework shown in FIG. 1 , in which some of the steps may be replaced by neural network-based techniques.
  • replace the predictive coding in the traditional hybrid video framework with neural network-based predictive coding techniques replace the transform coding in the traditional hybrid video framework with neural network-based transform coding, and replace the traditional hybrid video with neural network-based quantization.
  • Quantization in the framework, etc.; neural network-based technology can also be added to the traditional hybrid coding framework, for example, neural network-based coding technology is added after entropy coding, etc., which is not limited in this disclosure.
  • the traditional hybrid video coding framework can also be improved, and for the improved hybrid video coding framework, some or all of the steps can be replaced by neural network-based coding techniques, or after the improvement
  • a neural network-based coding technology is added to the hybrid coding framework of , which is not limited in the present disclosure.
  • video encoding framework used in the encoding process may also be in other forms, which is not limited in the present disclosure.
  • the encoded code stream After encoding the video data to be encoded by using the encoding technology including the neural network model, the encoded code stream can be obtained.
  • the generated code stream carries syntax elements, and the syntax elements contain information representing a parameter set of the neural network model. Based on the information representing the parameter set of the neural network model, the encoded code stream can be decoded by using the neural network model to obtain reconstructed video data.
  • the syntax elements can exist in the code stream independent of existing video coding standards, thereby enabling the neural network-based video coding technology It can be implemented independently; the syntax elements can also be located in a variety of existing coding standards, so as to be compatible with existing video coding standards.
  • the coupling between the neural network-based intelligent video coding technology and a specific video coding standard can be reduced, and the applicable scope of the neural network-based video coding technology can be expanded.
  • the syntax element including the parameter set information representing the neural network model is located in a specified position of the code stream, and the specified position is a reserved field of the code stream.
  • FIG. 4 a schematic diagram of a code stream 401 with a reserved field after encoding processing is given, wherein, the area of the code stream 401 except for the reserved field may be all video data, or may be video data and other
  • the information field is not limited in the present disclosure.
  • the code stream 401 may also have two or more reserved fields, and the syntax elements including the parameter set information representing the neural network model are carried separately, which is not limited in the present disclosure. .
  • the position of the reserved field in the code stream may be determined based on a predefined independent video coding standard.
  • the pre-defined independent video coding standards can be independent of other video coding standards, including H.264, H.265, VCC, AVS, AVS+, AVS2, AVS3 or AV1 formulated by organizations such as ITU-T and ISO/IEC and other video coding standards are used alone.
  • the parameter set of the neural network model can be directly placed in the reserved field of the generated code stream.
  • the specific position of the reserved field of the encoded code stream can be determined based on the pre-definition, and then the parameter set of the neural network model in the reserved field can be extracted, and the encoded code stream can be extracted. implement decoding.
  • the pre-defined independent video coding standards can also be used by other video coding standards.
  • the pre-defined independent video coding standard defines the field "010100011110" as the start position of the reserved field, and the field "101011100001" as the end position of the reserved field.
  • other video coding standards can also refer to this definition to determine the specific position of the reserved field in the encoded code stream.
  • the independent video coding standard may define the specific position of the reserved field in the code stream in other forms, and then be referenced by other video coding standards. This does not limit.
  • the code stream is generated based on a specified coding standard, and the reserved field is a specific field of the specified coding standard.
  • the specified coding standard may include: H.264 or H.265 or VCC or AVS or AVS+ or AVS2 or AVS3 or AV1 and other video coding standards.
  • the specified coding standard may have a certain code stream structure. As shown in FIG. 5, a specific code stream structure is provided.
  • the code stream structure is a hierarchical code stream structure, including the GOP layer, the image layer, Slice header, macroblock layer, block layer. Each layer of data further includes header information and video data information. Therefore, the specific field may be header information of the specified encoding standard.
  • the specified coding standard may also have other code stream structures, and the specific field may be a field predefined by the specified coding standard.
  • the specific field may be a video parameter set (Video Parameter Set, VPS), and ⁇ or Sequence Parameter Set (SPS), and ⁇ or Neural Network Parameter Set (NNPS), and ⁇ or Picture Parameter Set (PPS), and ⁇ or Strip Header ( Slice Header), and ⁇ or Supplemental Enhancement Information (SEI), and ⁇ or extension data of syntax element parameter set, and ⁇ or user data, and ⁇ or Open Bitstream Units (OBU), Sequence header, GOP header, picture header, slice header, macroblock information, etc.
  • VPS Video Parameter Set
  • SPS Sequence Parameter Set
  • NPS Neural Network Parameter Set
  • PPS Picture Parameter Set
  • SEI Supplemental Enhancement Information
  • OFBU Open Bitstream Units
  • the given specified coding standard is only an exemplary illustration, not exhaustive, and the specified coding standard may also be other coding standards, which is not limited in the present disclosure.
  • the specific fields are also only illustrative and not exhaustive, and may also be other specific fields of a specified coding standard, which are not limited in the present disclosure.
  • the specific position of the reserved field in the encoded code stream may also be determined in the following manner: before encoding the video data to generate a code stream carrying syntax elements, encoding
  • the end may negotiate with the opposite end device that wants to acquire the code stream through wired or wireless communication to determine the specific location of the reserved field.
  • the peer device may be a memory, a decoder, etc., which is not limited in the present disclosure.
  • the video coding technology based on neural networks can be independently implemented; the syntax elements also It can be located in a specific field of multiple existing coding standards or refer to a pre-defined independent video coding standard, so as to be compatible with the existing video coding standards.
  • the coupling between the neural network-based intelligent video coding technology and a specific video coding standard can be reduced, and the applicable scope of the neural network-based video coding technology can be expanded.
  • the reserved field is located in a packet header of a data packet of the code stream.
  • the reserved field may be located in the packet header of the data packet of the code stream, and the code stream structure is shown in FIG. 6 .
  • a specific character or character group can be used as the end mark of the syntax element, or a specified byte length can be set, and the content in the byte length can be set.
  • the syntax element when the byte length is insufficient, it is filled with 0.
  • the reserved field may also be located in the middle and the end of the data packet of the code stream.
  • the reserved field When the reserved field is located in the middle or end of the data packet of the code stream, it can be characterized by a specific character or character group, etc., from the beginning and end of a certain byte; it can also be directly stored in the byte of the data packet header. number, characterizing starts from a certain number of bytes, which are reserved fields for storing syntax elements containing information characterizing the parameter set of the neural network model.
  • other manners are also possible, which are not limited in the present disclosure.
  • the reserved field is located in the packet header of the data packet of the code stream, and the packet header of the data packet may refer to the header information syntax parameter set of the specified coding standard.
  • reserved fields are generally located in the extended data of VPS, SPS, NNPS, PPS, Slice Header, It can also be located in the SEI;
  • the reserved field can be located in the extension and user data syntax; for a part of the specified For the AV1 standard, the reserved field may be located in the OBU or in the extended data.
  • the reserved field may also be located in other header positions of the specified coding standard, which is not limited in the present
  • the syntax elements carried in the code stream can exist in the packet header of the reserved field of the code stream independent of the existing video coding standard, the video coding technology based on neural network can be independently realized.
  • the syntax element may also be located in the header information syntax parameter sets of multiple existing coding standards, so as to be compatible with the existing video coding standards.
  • the neural network model that can be used for encoding processing is usually determined by a parameter set consisting of multiple parameters.
  • the parameter set of the neural network model includes at least one or more of parameters such as input parameters, number of layers, weight parameters, hyperparameters, the number of nodes in each layer, and activation functions.
  • the parameter sets of the neural network models contain different contents.
  • the present disclosure does not specifically limit the number of specific parameter set parameters included in the parameter set of the neural network model.
  • those skilled in the art should understand that the above parameters such as input parameters, number of layers, weight parameters, hyperparameters, the number of nodes in each layer, and activation functions are only illustrative, not exhaustive, and the parameters of the neural network model
  • the set may also include other parameters for determining the neural network model, which is not limited by the present disclosure.
  • NEF Neural Network Exchange Format
  • ONNX Open Neural Network Exchange
  • the video encoding method includes converting the neural network model into a common format.
  • the common format may be NNFF or ONNX.
  • the general format may also be other general formats, so as to realize the general use of the neural network model among different deep learning frameworks, which is not limited in the present disclosure.
  • the information representing the parameter set of the neural network model contained in the syntax element may be the corresponding information after the parameter set of the neural network model is converted into a general format, that is, based on the corresponding information, The parameter sets of the neural network model applied in the encoding process under different deep learning frameworks can be obtained through conversion again.
  • the syntax element carried in the code stream generated by the encoding process further includes a format conversion enable flag, which is used to indicate whether to convert the neural network model into a common format during the encoding process.
  • a format conversion enable flag which is used to indicate whether to convert the neural network model into a common format during the encoding process.
  • “1" may be used to indicate that format conversion is used, and "0” may be used to indicate that format conversion is not used.
  • other representations may also be used to indicate whether to convert the neural network model into a common format during the encoding process.
  • NNR Neural Network Representation
  • AITISA AITISA's compression framework
  • the information representing the parameter set of the neural network model may be compressed.
  • the compression is a compression process performed by a compression technique in an NNR-based compression framework or AITISA's compression framework.
  • a compression technique in an NNR-based compression framework or AITISA's compression framework.
  • the compression may also be implemented by other compression techniques for the neural network model, which is not limited in the present disclosure.
  • the information representing the parameter set of the neural network model is information corresponding to compressing the parameter set of the neural network model, so as to save storage resources and reduce bandwidth resources occupied by video data transmission.
  • the syntax element further includes a compression enable flag, which is used to indicate whether to compress the parameter set of the neural network model during the encoding process.
  • a compression enable flag which is used to indicate whether to compress the parameter set of the neural network model during the encoding process.
  • the compression enable flag "1" may be used to indicate that the compression technology is used, and "0" may be used to indicate that the compression technology is not used.
  • other representations may also be used to indicate whether the parameter set of the neural network model is compressed during the encoding process.
  • the neural network model can be converted into a general format, and then the parameter set of the neural network model in the general format can be compressed, so as to realize the neural network model between different deep learning frameworks. It is interactive and universal, saves storage resources, and reduces the bandwidth resources occupied by video data transmission.
  • the information representing the parameter set of the neural network model is information corresponding to converting the neural network model into a general format and compressing the parameter set of the converted neural network model.
  • the syntax element further includes a format conversion enable flag and a compression enable flag.
  • the syntax element includes three sub-syntax elements, which are the format conversion enable flag, the compression enable flag, and the information representing the parameter set of the neural network model.
  • the format conversion enable flag indicates that in the encoding process of the code stream, the neural network model used is converted into a general format; the compression enable flag indicates that the parameter set of the neural network model converted into the general format is compressed ;
  • the information representing the parameter set of the neural network model is the parameter set after conversion and compression of the neural network model used in the encoding process.
  • Three sub-syntax elements may be located in the same reserved field, or may be located in multiple reserved fields, for example, three sub-syntax elements are located in three reserved fields respectively.
  • the specific location of the reserved field may be located in the code stream location defined by a pre-defined independent coding standard, as described above, or located in a specific field of other standards, etc., which will not be repeated here.
  • the correspondence between the reserved fields and the sub-grammar elements can be determined based on a predefined independent coding standard, or can be confirmed by additional definitions in other standards, or can be determined by referring to a predefined independent coding standard in other standards, etc. , which is not limited in this disclosure.
  • the syntax element further includes an enabling flag for encoding processing using the neural network model, for determining whether the encoding processing uses the neural network model.
  • FIG. 8 it is a video coding framework including a neural network-based loop filtering technique.
  • the process is as follows: block the image frames of the video data to be processed, and then perform predictive coding 101, wherein the predictive coding 101 includes intra-frame prediction and inter-frame prediction; After encoding the processed data, continue to perform transform coding 102 and quantization 103 to obtain quantized transform coefficients; at the same time, perform inverse quantization 202 and inverse transform coding 203 on the quantized data, and predict based on the data obtained after the predictive coding process.
  • in-loop filtering 801 is performed to improve the image quality of the reconstructed video data; finally, entropy encoding 104 is performed based on the quantized transform coefficients, prediction mode information, and related information of the prediction reconstruction process, etc., to obtain the encoded code stream.
  • the in-loop filter 801 usually includes a deblocking filter (DF) technique, a sample adaptive offset (SAO) technique, and an adaptive in-loop filter (Adative Offset). Loop Filter, ALF) technology and other traditional in-loop filtering technology.
  • DF deblocking filter
  • SAO sample adaptive offset
  • ALF adaptive in-loop filter
  • NNF Neural Network Filter
  • the position of the NNF in the in-loop filter 801 in FIG. 8 is only an exemplary illustration, and the NNF can also be located in other positions in the in-loop filter 801, for example, before DF, after ALF, and in the loop Between the ALF and the SAO of the inner filter 801, etc., and the in-loop filter 801 may also include multiple NNFs, which is not limited in the present disclosure.
  • the NNF technology can be used by default.
  • the syntax elements carried in the code stream obtained through the encoding process may also include an enable flag for encoding processing using the neural network model, which is used to determine whether the encoding processing uses the neural network model.
  • the syntax elements carried by the code stream not only include the information representing the parameter set of the neural network model, but also include the use of the neural network model to perform the encoding process.
  • the enable flag of the encoding process eg, "1" indicates that the neural network model is used in the encoding process, and "0" indicates that the neural network model is not used in the encoding process.
  • Other representations are also possible to indicate whether neural network based techniques are used in the encoding process.
  • the peer device receiving the generated code stream can quickly determine whether the encoding process uses the neural network model according to the enabling flag, and then The decoding speed of the generated code stream is accelerated.
  • the syntax elements carried by the code stream can be Excluding the enabling flag for encoding processing using the neural network model, the peer device that obtains the encoded code stream assumes that the code stream is generated through a neural network-based technology.
  • the improved traditional video coding framework may include at least one of the steps of predictive coding, transform coding, quantization, entropy coding, predictive reconstruction, and in-loop filtering. , and may also include other encoding processing steps. Therefore, the neural network model used in the encoding process may be located in a specific step, or may be located between two specific steps, and so on.
  • the syntax element carried by the code stream further includes processing timing information of the neural network model in the encoding process, where the processing timing information is used to indicate that the neural network model is in the encoding process. specific location.
  • the improved traditional video coding framework includes four steps of A, B, C, and D.
  • the video data to be coded goes through A, B in sequence. , C, D four steps of processing to obtain the code stream.
  • the syntax elements carried by the code stream may also include processing timing information of the neural network model in the encoding process.
  • the syntax element carried by the code stream may include a processing sequence information, and the processing timing information uses Yu indicates that the neural network-based in-loop filtering technique is between steps A and B of the encoding process.
  • the form of the processing time series information can be a specific character, for example, "0001" indicates that the neural network-based technology is located before step A of the encoding process; "0010” indicates that the neural network-based technology is located in the encoding process. Among the A steps of the encoding process; "0011” indicates that the neural network-based technology is located between the A and B steps of the encoding process; and so on.
  • the processing sequence information may also be represented in other forms, which is not limited in the present disclosure.
  • the processing timing information includes at least any one of the following information: information indicating that in predictive encoding, information indicating that in transform encoding, indicating that in Information in quantization, Information indicating in entropy encoding, Information indicating before predictive encoding, Information indicating between predictive encoding and transform encoding, Information indicating between transform encoding and quantization, Information indicating between quantization and entropy encoding information between, indicating information after entropy encoding; and so on.
  • the specific form of the processing time series information can be a specific character, for example: "000001” indicates that the neural network-based technology is used in predictive coding, "000010” indicates that the neural network-based technology is used in transform coding, and "001000” indicates that neural network based techniques are used between predictive coding and transform coding, and so on.
  • “000001” indicates that the neural network-based technology is used in predictive coding
  • "000010” indicates that the neural network-based technology is used in transform coding
  • 001000 indicates that neural network based techniques are used between predictive coding and transform coding, and so on.
  • the specific form of processing time sequence information may also be other representations, which are not limited in the present disclosure.
  • processing timing information is only an exemplary description, not exhaustive, and the processing timing information also Other content may be included to indicate the specific position of the neural network model in the encoding process based on the traditional video encoding framework using the neural network model.
  • the peer device that receives the generated code stream can determine, according to the timing information, which specific position the neural network-based technology is used in the encoding process, and then can The generated code stream is correctly decoded based on this information.
  • the in-loop filtering technology based on neural network can be added to the in-loop filtering 801 shown in FIG. 8 to improve the filtering effect of the reconstructed data; accordingly , the processing timing information is information indicating after prediction reconstruction.
  • a context probability estimation method based on neural network technology can be used to replace the traditional rule-based context probability prediction model, so as to improve the efficiency of entropy coding; accordingly, the processing time sequence information is the information indicating in the entropy coding .
  • the processing timing information is: Information indicating the inter prediction stage of predictive coding.
  • an intra-frame prediction technology based on a neural network can be used to compare with the traditional intra-frame prediction technology to decide the optimal intra-frame prediction method for intra-frame prediction;
  • the processing timing information is the information in the intra prediction stage of predictive coding.
  • a neural network-based super-resolution reconstruction technique can be added to obtain reconstructed video data with higher quality; the processing timing information is information indicating after entropy coding.
  • the encoding processing of the video data to be encoded may also include other neural network-based technologies.
  • the processing timing information may also be processing timing information of other contents, which is not limited in the present disclosure.
  • multiple neural network models may be included, and the multiple neural network models may be the multiple specific neural network models described above, or may also be are other neural network models.
  • the encoding process uses multiple neural network models to encode the video data to be encoded
  • the enabling identifiers included in the grammatical elements for encoding processing by using the neural network model may also have a total enabling identifier and a certain number of enabling identifiers corresponding to each neural network model one-to-one. There is no restriction on this publicly.
  • the syntax elements included in the code stream may further include identification information of the neural network models. Through the identification information, a specific neural network model can be uniquely determined.
  • the identification information may be an index value.
  • the index value "00000001” indicates the neural network model used by the neural network-based in-loop filtering technology
  • the index value "00000010” indicates the neural network model used by the neural network-based intra-frame prediction technology
  • the index value " 00000011” indicates the neural network model used by neural network-based image super-resolution techniques
  • the identification information can also be in other forms, which are used to uniquely determine the neural network model corresponding to the specific neural network-based technology, so that the peer device that obtains the code stream can uniquely determine the code stream based on the identification information. Which specific neural network model is used during the encoding process.
  • the syntax elements carried by the code stream may also include the neural network used for encoding processing
  • the framework information of the model's deep learning indicating the framework used by the neural network model.
  • TensorFlow is an open source software library, which can meet almost all the functions of machine learning development, but its function code is too low-level and the learning cost is high
  • PyTorch is a fast It is easy to experiment with deep learning frameworks, but it is highly encapsulated and not flexible enough; and so on. Therefore, the deep learning framework of the neural network applied in the coding process may be a specific deep learning framework due to factors such as application scenarios and processor resources.
  • the peer device of the code stream may not adopt the same deep learning framework as in the encoding process due to some reasons, or there are various optional deep learning frameworks. Therefore, the syntax elements carried by the code stream generated by the encoding process may also include the deep learning framework information of the neural network model used in the encoding process, so as to facilitate the acquisition of the opposite end device of the code stream. According to the framework information , to confirm that it can be decoded. Further, if the decoding cannot be performed due to the absence of the corresponding frame, decoding failure information may be generated, including the reason for the decoding failure.
  • the deep learning framework information may indicate a specific deep learning framework in characters, or may be in other representation forms, which are not limited in the present disclosure.
  • the encoding process further includes: determining a neural network framework and a video encoder; and performing training based on the neural network and the video encoder to obtain the neural network model.
  • the neural network model for encoding processing is a trained neural network model.
  • the deep learning framework and video encoder used by the neural network model can be preselected, and sample data can be constructed. Using the constructed sample data, training is performed based on the selected deep learning framework and video encoder, and when preset conditions are met, the trained neural network model is obtained for encoding the video data to be encoded.
  • the deep learning framework may be TensorFlow, Pytorch, Caffe2, Microsoft Cognitive Toolkit, Apache MXNet, etc., or an AI hardware accelerator. Of course, other types of deep learning frameworks are also possible.
  • the video encoder may be a video coding standard reference software platform such as VTM, HM, and JM.
  • the preset condition may be the minimum loss function or convergence and so on.
  • the video coding method described in the present disclosure will be described with reference to FIG. 9 by taking the coding process based on the traditional hybrid video coding framework including two neural network-based technologies as an example.
  • the two neural network-based technologies are respectively the neural network-based in-loop filtering technology in the encoding process shown in FIG. 9 and the neural network-based super-resolution reconstruction technology after data reconstruction.
  • the encoding process includes predictive encoding 101 , transform encoding 102 , quantization 103 , and predictive reconstruction shown in FIG. 8 .
  • the neural network-based in-loop filtering technique replaces the traditional in-loop filtering technique in the hybrid video coding technique.
  • the neural network-based super-resolution reconstruction technique is used to implement up-sampling to obtain reconstructed video data with higher image quality.
  • the sampled video data is encoded based on an encoding process.
  • the neural network-based in-loop filtering technology filters the data, and then the neural network-based super-resolution reconstruction technology is used to upsample the filtered data to obtain reconstructed video data.
  • the generated code stream carries syntax elements.
  • the grammatical elements can be as shown in Figure 10, including the identification information of the neural network model, the enable mark for encoding processing by using the neural network model, the format conversion enable mark, the compression enable mark, the processing time series information, and the neural network.
  • the parameter set performs multiple syntax sub-elements such as format conversion and compressed information.
  • two sets of information can be included, which respectively indicate the relevant information of the neural network-based in-loop filtering technology and the neural network-based super-resolution reconstruction technology according to the order of the information.
  • the first identification information is "00000001", indicating that the corresponding neural network model is the neural network model used by the neural network-based in-loop filtering technology; the second identification information It is "00000011", indicating that the corresponding neural network model is the neural network model used by the neural network-based image super-resolution technology.
  • the corresponding relationship between the information in other sub-grammar elements and the indication of the neural network model is the same as the "identification information of the neural network model", and will not be repeated here.
  • the multiple sub-syntax elements may be located in the same reserved field of the codestream, or may be located in multiple reserved fields of the codestream. The reserved fields are as described above, and are not repeated here.
  • the grammar element shown in Figure 10 can also be used as a grammar parameter unit, and each grammar parameter unit is used to store the following information of a neural network model: the identification information of the neural network model, the enabling flag for encoding processing using the neural network model, The format conversion enable flag, the compression enable flag, the processing time series information, and the information after format conversion and compression of the parameter set of the neural network.
  • each grammar parameter unit is used to store the following information of a neural network model: the identification information of the neural network model, the enabling flag for encoding processing using the neural network model, The format conversion enable flag, the compression enable flag, the processing time series information, and the information after format conversion and compression of the parameter set of the neural network.
  • two of the syntax parameter units are included in the codestream.
  • the two syntax parameter units may be located in the same reserved field in sequence, or may be located in different reserved fields. The reserved fields are as described above, and are not repeated here.
  • the syntax elements carried in the code stream can exist in the code stream independent of existing video coding standards, thereby enabling the video coding technology based on neural networks to be independently implemented; the syntax Elements can also be located in multiple existing coding standards and thus be compatible with existing video coding standards.
  • the coupling between the neural network-based intelligent video coding technology and a specific video coding standard can be reduced, and the applicable scope of the neural network-based video coding technology can be expanded.
  • the syntax element in addition to the information representing the parameter set of the neural network model, may also include an enabling flag for encoding processing using the neural network model, to indicate whether the encoding processing uses a neural network-based technology to enable
  • the peer device that obtains the code stream realizes fast decoding it can also include a format conversion enable flag to indicate whether the neural network model has undergone format conversion during the encoding process, so as to achieve common use in different deep learning frameworks; it can also be It includes a compression enable flag to indicate whether the parameters of the neural network model are compressed during the encoding process to save storage space and communication resources; it can also include processing timing information to indicate that during the encoding process, the neural network-based The specific location of the technology; the identification information of the neural network model may also be included, so as to uniquely indicate a single neural network model when multiple neural network models are utilized in the encoding process; and so on.
  • syntax elements are merely illustrative and not exhaustive. Based on the syntax elements, the relevant information of the neural network model in the encoding process can be characterized, so that the opposite end device that receives the code stream can decode the code stream accurately and quickly, and obtain high-quality decoding effect.
  • the present disclosure also provides a neural network-based video coding method, the method comprising:
  • Step 1101 encoding the video data by using a neural network model
  • Step 1102 Send the parameter set of the neural network model to the video encoder, so that the video encoder generates a code stream carrying syntax elements based on the encoded video data, the syntax elements Information about the parameter set of the neural network model.
  • the neural network model used for encoding the video data may be determined by a parameter set composed of multiple parameters.
  • the parameter set of the neural network model includes one or more parameters such as input parameters, number of layers, weight parameters, hyperparameters, the number of nodes in each layer, and activation functions.
  • the parameter sets of the neural network models contain different contents.
  • the present disclosure does not specifically limit the number of specific parameter set parameters included in the parameter set of the neural network model.
  • those skilled in the art should understand that the above parameters such as input parameters, number of layers, weight parameters, hyperparameters, the number of nodes in each layer, and activation functions are only illustrative, not exhaustive, and the parameters of the neural network model
  • the set may also include other parameters for determining the neural network model, which is not limited by the present disclosure.
  • the encoding process performed on the video data using the neural network model in step 1101 may be implemented using an end-to-end neural network without relying on a traditional hybrid video encoding framework. That is, the user can select a specific deep learning framework, use the video data to construct a data set, and train the neural network model, so that the trained neural network model can be used to input the video data to be encoded into the trained neural network. After the model, a code stream that meets certain requirements can be generated. The obtained code stream can be decoded, and the relevant parameters such as the compression rate and distortion degree of the video data reconstructed based on the code stream can meet the needs of the user.
  • the encoding process performed on the video data using the neural network model in step 1101 may be implemented in combination with the traditional hybrid encoding framework as shown in FIG. 1 .
  • the encoding processing of the video data using the neural network model includes at least one of the following steps:
  • a neural network-based intra-frame prediction technique is performed
  • a neural network-based image super-resolution technique is performed for inter-frame motion estimation
  • a neural network based contextual probability estimation technique is performed.
  • the specific neural network models that can be used in the above encoding processing are merely illustrative, not enumerations.
  • the encoding processing of the video data to be encoded may also include other neural network-based technologies.
  • the specific position of the neural network model in the traditional coding framework can also be adaptively determined according to the actual situation, which is not limited in the present disclosure.
  • the encoding processing of the video data using the neural network model in step 1101 can also be implemented in combination with other improved hybrid encoding frameworks.
  • the specific neural network model used and the hybrid encoding framework The specific position of the can be determined adaptively according to the actual situation, which is not limited in the present disclosure.
  • the related content of the syntax element carried in the code stream described in this method may be the same as the related content of the syntax element in the aforementioned video coding method.
  • the syntax element carried by the generated code stream may be located at a specified position of the code stream, and the specified position is a reserved field of the code stream.
  • the reserved field where the code stream is located may be located in the header of the data packet of the code stream.
  • the information representing the parameter set of the neural network model included in the syntax element is the corresponding information after the parameter set of the neural network model is converted into a general format.
  • the syntax element further includes a format conversion enable flag for instructing to convert the neural network model to the general format.
  • the information representing the parameter set of the neural network model included in the syntax element is information corresponding to the compressed parameter set of the neural network model.
  • the syntax element further includes a compression enable flag for instructing to compress the parameter set of the neural network model.
  • the information representing the parameter set of the neural network model contained in the syntax element is corresponding information after converting the parameter set of the neural network model into a common format and compressing it.
  • the syntax element further includes an enabling flag for encoding processing using a neural network model for determining whether the encoding processing uses the neural network.
  • the syntax element further includes processing timing information of the neural network model in the encoding process, where the processing timing information is used to indicate a specific position of the neural network model in the encoding process.
  • the processing timing information contained in the syntax element includes at least any of the following: Information: information indicating in predictive encoding, information indicating in transform encoding, information indicating in quantization, information indicating in entropy encoding, information indicating before predictive encoding, indicating information between predictive encoding and transform encoding information between transform encoding and quantization, information indicating between quantization and entropy encoding, information indicating after entropy encoding.
  • the syntax element further includes identification information of the neural network model.
  • the syntax element further includes frame information of the neural network model, the frame information being used to indicate a frame used by the neural network model.
  • the codestream is generated based on a specified coding standard, a reserved field where the syntax element is located, a video parameter set located in the codestream, and ⁇ or a sequence parameter set, and ⁇ or a neural network model parameter set, and ⁇ or image parameter set, and ⁇ or slice header, and ⁇ or auxiliary enhancement information, and ⁇ or syntax element parameter set extension data, and ⁇ or user data, and ⁇ or open bitstream unit, and ⁇ or sequence header, and ⁇ or picture header, and ⁇ or group of pictures header, and ⁇ or slice header, and ⁇ or macroblock information.
  • the reserved field can be a specific field of these video coding standards, which can be a video Parameter Set (Video Parameter Set, VPS), and ⁇ or Sequence Parameter Set (SPS), and ⁇ or Neural Network Parameter Set (Neural Network Parameter Set, NNPS), and ⁇ or Image Parameter Set (Picture Parameter Set) Set, PPS), and ⁇ or Slice Header, and ⁇ or Supplemental Enhancement Information (SEI), and ⁇ or Extended Data for Syntax Element Parameter Set, and ⁇ or User Data, and ⁇ or Open Bitstream unit (Open Bitstream Units, OBU), sequence header, GOP header, picture header, slice header, macroblock information, etc.
  • the given specified coding standard is only an exemplary illustration, not exhaustive, and the specified coding standard may also be other coding standards, which is not limited in the present disclosure.
  • the specific fields are also only illustrative and not exhaustive, and may also be other specific fields of a specified coding standard, which are not limited in the present disclosure.
  • the method for encoding video data using a neural network model further comprises: applying the neural network Models are converted to a common format.
  • the neural network framework By converting the neural network model to a common format, the neural network framework can be mapped onto the inference engine.
  • These general formats provide interfaces for the conversion of neural network models generated by commonly used deep learning frameworks, so as to realize the interaction and commonality of neural network models between different deep learning frameworks.
  • the common format includes NNFF or ONNX.
  • the general format can also be other general formats, so as to realize the general use of the neural network model among different deep learning frameworks, which is not limited in the present disclosure.
  • the neural network-based video coding method further includes: The information of the parameter set of the neural network is compressed.
  • the compression is a compression process performed by a compression technique in an NNR-based compression framework or AITISA's compression framework.
  • a compression technique in an NNR-based compression framework or AITISA's compression framework.
  • the compression may also be implemented by other compression techniques for the neural network model, which is not limited in the present disclosure.
  • the neural network model is determined by: determining a neural network framework and a video encoder; and performing training based on the neural network and the video encoder to obtain the neural network model.
  • the neural network model for encoding processing is a trained neural network model.
  • the deep learning framework and video encoder used by the neural network model can be preselected, and sample data can be constructed. Using the constructed sample data, training is performed based on the selected deep learning framework and video encoder, and when preset conditions are met, the trained neural network model is obtained for encoding the video data to be encoded.
  • the deep learning framework of the neural network model may be TensorFlow, Pytorch, Caffe2, Microsoft Cognitive Toolkit, Apache MXNet, etc., and may also be an AI hardware accelerator.
  • the video encoder may be a video coding standard reference software platform such as VTM, HM, and JM.
  • the preset condition may be the minimum loss function or convergence and so on.
  • the syntax element since the code stream carries the syntax element including the information representing the parameter set of the neural network model, the syntax element may exist in the code stream independent of the existing video coding standard, thereby making the code stream based on the
  • the video coding technology of the neural network can be implemented independently; the syntax elements can also be located in various existing coding standards, and thus are compatible with the existing video coding standards.
  • the coupling between the neural network-based intelligent video coding technology and a specific video coding standard can be reduced, and the applicable scope of the neural network-based video coding technology can be expanded.
  • the present disclosure also provides a video decoding method, which includes:
  • Step 1201 Parse the received video code stream to obtain syntax elements of the video code stream, where the syntax elements include information representing the parameter set of the neural network model;
  • Step 1202 According to the syntax element, use a neural network model corresponding to the information representing the parameter set to decode the video stream.
  • the decoding process for the video data can be implemented by using an end-to-end neural network without relying on the traditional hybrid video decoding framework.
  • the decoding process performed on the video data may also be implemented based on a traditional hybrid video decoding framework.
  • the conventional hybrid video decoding framework may be the decoding framework shown in FIG. 2 .
  • the related content of the syntax elements is the same as the related content of the syntax elements in a video decoding method described above.
  • the syntax element may be located in a specified position of the code stream, and the specified position is a reserved field of the code stream.
  • the reserved field in which the syntax element is located may be located in the packet header of the data packet of the code stream.
  • the parameter set of the neural network model includes one or more parameters such as input parameters, number of layers, weight parameters, hyperparameters, the number of nodes in each layer, and activation functions.
  • the above parameters such as input parameters, number of layers, weight parameters, hyperparameters, the number of nodes in each layer, and activation functions are only illustrative, not exhaustive, and the parameter set of the neural network model also includes Other parameters for determining the neural network model may be included, which are not limited by the present disclosure.
  • the information representing the parameter set of the neural network model included in the syntax element is the corresponding information after the parameter set of the neural network model is converted into a general format.
  • the decoding end can convert the parameter set of the neural network model that has passed the format into a neural network model that the decoding end can use by default for decoding.
  • the syntax element further includes a format conversion enable flag for instructing to convert the neural network model to the general format. Therefore, the decoding end can determine whether to convert the parameter set information of the neural network model in the decoding process based on the format conversion enable flag.
  • the syntax element may also include information characterizing the framework of the neural network model.
  • the framework of the neural network model may include: TensorFlow or Pytorch or Caffe2 or AI hardware accelerator.
  • TensorFlow or Pytorch or Caffe2 or AI hardware accelerator.
  • other types of deep learning frameworks may also be used, which is not limited in this application.
  • the information representing the parameter set of the neural network model included in the syntax element is information corresponding to the compressed parameter set of the neural network model.
  • the decoding end can decompress the parameter set of the neural network model by default for decoding.
  • the syntax element further includes a compression enable flag for instructing to compress the parameter set of the neural network model. Therefore, the decoding end can determine whether to decompress the parameter set information of the neural network model correspondingly in the decoding process based on the compression enable flag.
  • the information representing the parameter set of the neural network model contained in the syntax element is corresponding information after converting the parameter set of the neural network model into a common format and compressing it.
  • the decoding end can convert and decompress the parameter set of the neural network model by default for decoding.
  • the syntax element further includes an enabling flag for encoding processing using a neural network model for determining whether the encoding processing uses the neural network. Therefore, when decoding, the decoding end can confirm whether the decoding process uses the corresponding neural network based on the enable flag.
  • the syntax element further includes processing timing information of the neural network model in the encoding process, where the processing timing information is used to indicate a specific position of the neural network model in the encoding process. Therefore, when decoding, the decoding end can determine, based on the processing timing information, which specific positions in the decoding process are to be decoded using the neural network-based technology.
  • the processing timing information contained in the syntax element includes at least any of the following: Information: information indicating in predictive encoding, information indicating in transform encoding, information indicating in quantization, information indicating in entropy encoding, information indicating before predictive encoding, indicating information between predictive encoding and transform encoding information between transform encoding and quantization, information indicating between quantization and entropy encoding, information indicating after entropy encoding. Therefore, the decoding end can determine, based on the above processing timing information, which specific positions in the decoding process should be decoded using the neural network-based technology.
  • the syntax element further includes identification information of the neural network model.
  • the codestream is generated based on a specified coding standard, a reserved field where the syntax element is located, a video parameter set located in the codestream, and ⁇ or a sequence parameter set, and ⁇ or a neural network model parameter set, and ⁇ or image parameter set, and ⁇ or slice header, and ⁇ or auxiliary enhancement information, and ⁇ or syntax element parameter set extension data, and ⁇ or user data, and ⁇ or open bitstream unit, and ⁇ or sequence header, and ⁇ or picture header, and ⁇ or group of pictures header, and ⁇ or slice header, and ⁇ or macroblock information.
  • the reserved field can be a specific field of these video coding standards, which can be a video Parameter Set (Video Parameter Set, VPS), and ⁇ or Sequence Parameter Set (SPS), and ⁇ or Image Parameter Set (Picture Parameter Set, PPS), and ⁇ or Neural Network Model Parameter Set (Neural Network Parameter Set, NNPS), and ⁇ or Slice Header, and ⁇ or Supplemental Enhancement Information (SEI), and ⁇ or Extended Data for Syntax Element Parameter Sets, and ⁇ or User Data, and ⁇ or Open Bitstream unit (Open Bitstream Units, OBU), sequence header, GOP header, picture header, slice header, macroblock information, etc.
  • VPS Video Parameter Set
  • SPS Sequence Parameter Set
  • PPS Image Parameter Set
  • NPS Neural Network Model Parameter Set
  • SEI Supplemental Enhancement Information
  • OBU Open Bitstream Units
  • the given specified coding standard is only an exemplary illustration, not exhaustive, and the specified coding standard may also be other coding standards, which is not limited in the present disclosure.
  • the specific fields are also only illustrative and not exhaustive, and may also be other specific fields of a specified coding standard, which are not limited in the present disclosure.
  • the encoding process performed by the encoding end on the video data further includes: converting the neural network model into a Common format.
  • the video decoding method further includes: when the format conversion enable flag instructs the neural network model to be converted into the general format, converting the neural network model in the general format into a neural network model of a specified framework for application in the general format. decoding of the code stream.
  • the common format includes NNFF or ONNX.
  • the general format may also be other general formats, so as to realize the general use of the neural network model among different deep learning frameworks, which is not limited in the present disclosure.
  • the encoding end will perform the processing on the information representing the parameter set of the neural network. compression.
  • the video decoding method further includes: when the compression enable flag indicates to compress the parameter set of the neural network model, decompress the compressed parameter set of the neural network model.
  • the decompression is a decompression process performed based on a decompression technology corresponding to the compression technology in the compression framework of NNR or the compression framework of AITISA.
  • a decompression technology corresponding to the compression technology in the compression framework of NNR or the compression framework of AITISA.
  • the decompression may also be implemented by other decompression technologies for the neural network model, which is not limited in the present disclosure.
  • the encoding end may encode the video data based on the neural network technology.
  • the video decoding method may further include at least one of the following steps:
  • Utilizing the neural network model to perform decoding processing is an in-loop filtering technique based on a neural network, and the processing time series information is information indicating after prediction and reconstruction;
  • Using the neural network model to perform decoding processing is an intra-frame prediction technology based on a neural network, and the processing timing information is information indicating intra-frame prediction in the prediction and reconstruction;
  • the decoding processing using the neural network model is a neural network-based image super-resolution technology for performing inter-frame motion estimation, and the processing timing information is information indicating the inter-frame prediction stage of prediction and reconstruction;
  • Using the neural network model to perform decoding processing is an image super-resolution technology based on a neural network, for obtaining a reconstructed image, and the processing time series information is information indicating that after prediction and reconstruction;
  • the decoding process using a neural network model is a contextual probability estimation technique based on a neural network, and the processing timing information is the information indicative in entropy decoding; and so on.
  • the specific neural network model that can be used in the above-mentioned decoding process is only an exemplary description, not an enumeration, and the decoding process of the code stream may also include other neural network-based technologies.
  • the processing timing information may also be processing timing information of other contents, which is not limited in the present disclosure.
  • the decoding process is implemented based on a video decoder.
  • the decoder includes a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the video decoding technology based on neural networks can be independently implemented; the syntax elements also It can be located in a specific field of multiple existing coding standards or refer to a pre-defined independent video coding standard, so as to be compatible with the existing video coding standards.
  • the coupling between the neural network-based intelligent video decoding technology and a specific video coding standard can be reduced, and the applicable scope of the neural network-based video decoding technology can be expanded.
  • the present disclosure also provides a video decoding method based on a neural network, the method comprising:
  • Step 1301 Acquire syntax elements obtained after the video decoder parses the video stream, the syntax elements include information representing the parameter set of the neural network model;
  • Step 1302 According to the syntax element, use the neural network model corresponding to the parameter set to decode the video stream.
  • the neural network model used to encode the video data may be determined by a parameter set composed of multiple parameters.
  • the parameter set of the neural network model includes one or more parameters such as input parameters, number of layers, weight parameters, hyperparameters, the number of nodes in each layer, and activation functions.
  • the above parameters such as input parameters, number of layers, weight parameters, hyperparameters, the number of nodes in each layer, and activation functions are only illustrative, not exhaustive, and the parameter set of the neural network model also includes Other parameters for determining the neural network model may be included, which are not limited by the present disclosure.
  • the decoding process of the video code stream using the neural network model corresponding to the parameter set in step 1302 may be implemented in combination with the traditional hybrid coding framework as shown in FIG. 2 .
  • Using the neural network model to decode the video data may include at least one of the following steps:
  • a neural network-based image super-resolution technique is performed for inter-frame motion estimation
  • a neural network based contextual probability estimation technique is performed.
  • the specific neural network model that can be used in the above-mentioned decoding process is only an exemplary description, not an enumeration, and the decoding process of the video stream may also include other neural network-based technologies.
  • the specific position of the neural network model in the traditional decoding framework can also be adaptively determined according to the actual situation of the encoding end, which is not limited in the present disclosure.
  • the decoding processing of the video code stream using the neural network model in step 1302 can also be implemented in combination with other improved hybrid decoding frameworks.
  • the specific neural network model used and the hybrid decoding process The specific position in the frame can be adaptively determined according to the actual situation of the encoding end, which is not limited in the present disclosure.
  • the related content of the syntax element carried in the code stream described in this method may be the same as the related content of the syntax element in the aforementioned video coding method.
  • the syntax element carried by the generated code stream may be located at a specified position of the code stream, and the specified position is a reserved field of the code stream.
  • the reserved field where the code stream is located may be located in the header of the data packet of the code stream.
  • the information representing the parameter set of the neural network model included in the syntax element is the corresponding information after the parameter set of the neural network model is converted into a general format.
  • the decoding end can convert the parameter set of the neural network model that has passed the format into a neural network model that the decoding end can use by default for decoding.
  • the syntax element further includes a format conversion enable flag for instructing to convert the neural network model to the general format. Therefore, the decoding end can determine whether to convert the parameter set information of the neural network model in the decoding process based on the format conversion enable flag.
  • the syntax element may also include information characterizing the framework of the neural network model.
  • the framework of the neural network model may include: TensorFlow or Pytorch or Caffe2 or AI hardware accelerator.
  • TensorFlow or Pytorch or Caffe2 or AI hardware accelerator.
  • other types of deep learning frameworks may also be used, which is not limited in this application.
  • the information representing the parameter set of the neural network model included in the syntax element is information corresponding to the compressed parameter set of the neural network model.
  • the decoding end can decompress the parameter set of the neural network model by default for decoding.
  • the syntax element further includes a compression enable flag for instructing to compress the parameter set of the neural network model. Therefore, the decoding end can determine whether to decompress the parameter set information of the neural network model correspondingly in the decoding process based on the compression enable flag.
  • the information representing the parameter set of the neural network model contained in the syntax element is corresponding information after converting the parameter set of the neural network model into a common format and compressing it.
  • the decoding end can convert and decompress the parameter set of the neural network model by default for decoding.
  • the syntax element further includes an enabling flag for encoding processing using a neural network model for determining whether the encoding processing uses the neural network. Therefore, when decoding, the decoding end can confirm whether the decoding process uses the corresponding neural network based on the enable flag.
  • the syntax element further includes processing timing information of the neural network model in the encoding process, where the processing timing information is used to indicate a specific position of the neural network model in the encoding process. Therefore, when decoding, the decoding end can determine, based on the processing timing information, which specific positions in the decoding process are to be decoded using the neural network-based technology.
  • the processing timing information contained in the syntax element includes at least any of the following: Information: information indicating in predictive encoding, information indicating in transform encoding, information indicating in quantization, information indicating in entropy encoding, information indicating before predictive encoding, indicating information between predictive encoding and transform encoding information between transform encoding and quantization, information indicating between quantization and entropy encoding, information indicating after entropy encoding. Therefore, the decoding end can determine, based on the above processing timing information, which specific positions in the decoding process should be decoded using the neural network-based technology.
  • the syntax element further includes identification information of the neural network model.
  • the codestream is generated based on a specified coding standard, a reserved field where the syntax element is located, a video parameter set located in the codestream, and ⁇ or a sequence parameter set, and ⁇ or a neural network model parameter set, and ⁇ or image parameter set, and ⁇ or slice header, and ⁇ or auxiliary enhancement information, and ⁇ or syntax element parameter set extension data, and ⁇ or user data, and ⁇ or open bitstream unit, and ⁇ or sequence header, and ⁇ or picture header, and ⁇ or group of pictures header, and ⁇ or slice header, and ⁇ or macroblock information.
  • the reserved field can be a specific field of these video coding standards, which can be a video Parameter Set (Video Parameter Set, VPS), and ⁇ or Sequence Parameter Set (SPS), and ⁇ or Neural Network Parameter Set (Neural Network Parameter Set, NNPS), and ⁇ or Image Parameter Set (Picture Parameter Set) Set, PPS), and ⁇ or Slice Header, and ⁇ or Supplemental Enhancement Information (SEI), and ⁇ or Extended Data for Syntax Element Parameter Set, and ⁇ or User Data, and ⁇ or Open Bitstream unit (Open Bitstream Units, OBU), sequence header, GOP header, picture header, slice header, macroblock information, etc.
  • the given specified coding standard is only an exemplary illustration, not exhaustive, and the specified coding standard may also be other coding standards, which is not limited in the present disclosure.
  • the specific fields are also only illustrative and not exhaustive, and may also be other specific fields of a specified coding standard, which are not limited in the present disclosure.
  • the encoding process performed by the encoding end on the video data further includes: converting the neural network model into a Common format.
  • the video decoding method further includes: when the format conversion enable flag instructs the neural network model to be converted into the general format, converting the neural network model in the general format into a neural network model of a specified framework for application in the general format. decoding of the code stream.
  • the common format includes NNFF or ONNX.
  • the general format may also be other general formats, so as to realize the general use of the neural network model among different deep learning frameworks, which is not limited in the present disclosure.
  • the encoding end will perform the processing on the information representing the parameter set of the neural network. compression.
  • the video decoding method further includes: when the compression enable flag indicates to compress the parameter set of the neural network model, decompress the compressed parameter set of the neural network model.
  • the decompression is a decompression process performed based on a decompression technology corresponding to the compression technology in the compression framework of NNR or the compression framework of AITISA.
  • a decompression technology corresponding to the compression technology in the compression framework of NNR or the compression framework of AITISA.
  • the decompression may also be implemented by other decompression technologies for the neural network model, which is not limited in the present disclosure.
  • the encoding end may encode the video data based on the neural network technology.
  • the video decoding method may further include at least one of the following steps:
  • Utilizing the neural network model to perform decoding processing is an in-loop filtering technique based on a neural network, and the processing time series information is information indicating after prediction and reconstruction;
  • Using the neural network model to perform decoding processing is an intra-frame prediction technology based on a neural network, and the processing timing information is information indicating intra-frame prediction in the prediction and reconstruction;
  • the decoding processing using the neural network model is a neural network-based image super-resolution technology for performing inter-frame motion estimation, and the processing timing information is information indicating the inter-frame prediction stage of prediction and reconstruction;
  • Using the neural network model to perform decoding processing is an image super-resolution technology based on a neural network, for obtaining a reconstructed image, and the processing time series information is information indicating that after prediction and reconstruction;
  • the decoding process using a neural network model is a contextual probability estimation technique based on a neural network, and the processing timing information is the information indicative in entropy decoding; and so on.
  • the specific neural network model that can be used in the above-mentioned decoding process is only an exemplary description, not an enumeration, and the decoding process of the code stream may also include other neural network-based technologies.
  • the processing timing information may also be processing timing information of other contents, which is not limited in the present disclosure.
  • the decoding process is implemented based on a video decoder.
  • the decoder includes a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the video decoding technology based on neural networks can be independently implemented; the syntax elements also It can be located in a specific field of multiple existing coding standards or refer to a pre-defined independent video coding standard, so as to be compatible with the existing video coding standards.
  • the coupling between the neural network-based intelligent video decoding technology and a specific video coding standard can be reduced, and the applicable scope of the neural network-based video decoding technology can be expanded.
  • the present disclosure also provides a video encoder, the schematic diagram of which is shown in FIG. 14 , the encoder includes: a memory 1401, a processor 1402, and a computer program stored in the memory and running on the processor , the processor implements the following methods when executing the program:
  • the encoding processing includes encoding processing using a neural network model
  • a code stream carrying syntax elements is generated, and the syntax elements include information representing a parameter set of the neural network model.
  • the present disclosure also provides a video decoder, the schematic diagram of which can also be shown in FIG. 14 , the decoder includes: a memory, a processor, and a computer program stored in the memory and executable on the processor , the processor implements the following methods when executing the program:
  • the video code stream is decoded by using the neural network model corresponding to the parameter set.
  • the present disclosure also provides an AI accelerator for video encoding, the schematic structural diagram of which can also be shown in FIG. 14 , the AI accelerator includes: a memory, a processor, and a memory that is stored in the memory and can be used on the processor.
  • the processor implements the following methods when executing the program:
  • the video encoder sends the parameter set of the neural network to the video encoder, so that the video encoder generates a code stream carrying syntax elements based on the encoded video data, the syntax elements include the parameters representing the parameter set of the neural network. information.
  • AI accelerator may also be used to implement the various video encoding embodiments described in the foregoing disclosure.
  • the present disclosure also provides an AI accelerator for video decoding, the schematic diagram of which can also be shown in FIG. 14 , the AI accelerator includes: a memory, a processor, and a memory, a processor, and an AI accelerator stored in the memory and available on the processor.
  • the processor implements the following methods when executing the program:
  • the video code stream is decoded using the neural network model corresponding to the parameter set.
  • Embodiments of the present disclosure also provide a computer-readable storage medium, including instructions, which, when executed on a computer, cause the computer to perform the method described in any of the foregoing embodiments.
  • Computer readable media includes both persistent and non-permanent, removable and non-removable media and can be implemented by any method or technology for storage of information.
  • Information may be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.
  • a typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, e-mail device, game control desktop, tablet, wearable device, or a combination of any of these devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本公开提供一种视频编码方法,所述方法包括:对视频数据进行编码处理,所述编码处理包括利用神经网络模型的编码处理;基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络模型的参数集的信息。

Description

视频编码方法、解码方法、编码器、解码器以及AI加速器 技术领域
本公开涉及信息技术领域,尤其涉及一种视频编码方法、基于神经网络的视频编码方法、视频解码方法、基于神经网络的视频解码方法、视频编码器、视频解码器、用于视频编码的AI加速器以及用于视频解码的AI加速器。
背景技术
随着计算机科学技术的不断发展,基于神经网络的技术被越来越广泛地应用到各个领域。由于基于神经网络的技术相对于传统的技术通常能够获得更优异的效果,因此,在视频编码技术中,也引入了一些基于神经网络的编码工具。然而,现有的利用神经网络的视频编码技术大多是基于某个特定的编码标准实现,与该特定的编码标准具有较高的耦合性,降低了基于神经网络的视频编码技术的适用范围。
发明内容
本公开提供了一种视频编码方法、基于神经网络的视频编码方法、视频解码方法、基于神经网络的视频解码方法、视频编码器、视频解码器、用于视频编码的AI加速器以及用于视频解码的AI加速器,可以令利用神经网络的视频编码技术能够独立于现有的编解码标准使用或者与现有的编码标准兼容使用,克服相关技术的缺陷。
第一方面,本公开实施例提供一种视频编码方法,所述方法包括:对视频数据进行编码处理,所述编码处理包括利用神经网络模型的编码处理;基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络模型的参数集的信息。
第二方面,本公开实施例提供一种基于神经网络的视频编码方法,所述方法包括:利用神经网络模型对视频数据进行编码处理;将所述神经网络模型的参数集发送给视频编码器,以使所述视频编码器基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络模型的参数集的信息。
第三方面,本公开实施例提供一种视频解码方法,所述方法包括:对所接收的视频码流进行解析,获取所述视频码流的语法元素,所述语法元素包含表征神经网络 模型的参数集的信息;根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码处理。
第四方面,本公开实施例提供一种基于神经网络的视频解码方法,所述方法包括:获取视频解码器对视频码流解析后获得的语法元素,所述语法元素包含表征神经网络模型的参数集的信息;根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码处理。
第五方面,本公开实施例提供一种视频编码器,所述编码器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:对视频数据进行编码处理,所述编码处理包括利用神经网络模型的编码处理;基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络模型的参数集的信息。
第六方面,本公开实施例提供一种视频解码器,所述视频解码器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:对所接收的视频码流进行解析,获取所述视频码流的语法元素,所述语法元素包含表征神经网络模型的参数集的信息;根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码处理。
第七方面,本公开实施例提供一种用于视频编码的AI加速器,所述AI加速器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:对视频数据进行编码处理;将神经网络的参数集发送给视频编码器,以使所述视频编码器基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络的参数集的信息。
第八方面,本公开实施例提供一种用于视频解码的AI加速器,所述AI加速器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:获取视频解码器对视频码流解析后获得的语法元素,所述语法元素包含表征神经网络的参数集的信息;根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码。
第九方面,本公开实施例提供一种机器可读存储介质,所述机器可读存储介质上存储有若干计算机指令,所述计算机指令被执行时实现本公开第一至四方面任一所述的方法。
本公开的实施例提供的技术方案可以包括以下有益效果:
本公开实施例中,对于编码处理中应用到基于神经网络的智能编码技术,通过在码流中携带有包含表征神经网络模型的参数集的信息的语法元素,基于所述语法元素,实现对视频数据基于神经网络的编码。由于在码流中携带包含表征神经网络模型的参数集的信息的语法元素,所述语法元素可以不依赖于现有的视频编码标准而存在于码流中,进而使得基于神经网络的视频编码技术能够独立实现;所述语法元素也可以位于现有的多种编码标准之中,进而与现有的视频编码标准相兼容。通过本公开实施例的方法,能够降低基于神经网络的智能视频编码技术与特定视频编码标准的耦合性,扩大基于神经网络的视频编码技术的适用范围。
附图说明
图1是本公开实施例的一种传统的混合视频编码框架示意图。
图2是本公开实施例的一种传统的混合视频解码框架示意图。
图3是本公开实施例的一种视频编码方法的流程图。
图4是本公开实施例的一种预留字段在码流中的指定位置的示意图。
图5是本公开实施例的一种视频编码标准的层次码流结构示意图。
图6是本公开实施例的一种视频编码处理所生成的码流结构示意图。
图7是本公开实施例的一种视频编码处理所生产的码流的部分结构示意图。
图8是本公开实施例的一种包含基于神经网络的环路滤波技术的视频编码框架示意图。
图9是本公开实施例的一种包含两种基于神经网络的技术的视频编码框架示意图。
图10是本公开实施例的一种位于预留字段的语法元素的结构示意图。
图11是本公开实施例的一种基于神经网络的视频编码方法流程图。
图12是本公开实施例的一种视频解码方法流程图。
图13是本公开实施例的一种基于神经网络的视频解码方法流程图。
图14是本公开实施例的一种编码器的结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开说明书和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和\或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
由于未经编码的视频的数据量较大,未经编码的视频在存储时会占据较大的存储空间,在传输时会占据较大的通信资源。因此,对于未经编码的视频数据,通常会采用视频编码技术,降低视频数据的存储空间及在传输过程中所使用的通信资源。
视频编码技术是通过一些数据处理方法,将待编码的视频数据进行压缩编码,形成码流,进行存储或者发送给解码端。所存储的或者解码端所接收到的码流,通过解码技术将码流进行重建,获得重建视频数据。
传统的视频编码技术经过不断的发展,已经逐渐形成一套成熟的视频编码框架。参见图1,是一种包括预测编码101、变换编码102、量化103以及熵编码104的传统的混合视频编码框架。
其中,预测编码101是通过利用待编码的视频数据中的图像帧的帧内像素在空间上的相关性以及帧间像素在时间上的相关性,去除待编码的视频数据在时域和空域的冗余信息。
当前比较常用的预测编码方法包括帧内预测和帧间预测。帧内预测用于去除视频数据在空域的冗余信息;帧间预测用于去除视频数据在时域的冗余信息。在进行预测编码之前,通常对当前待编码的视频数据的图像帧进行划块。帧内预测方法基于待 编码块周围相邻块的像素生成预测块;帧间预测方法通过在参考帧图像中搜索与当前待编码块最匹配的图像块获得预测块。
针对帧内预测或帧间预测获得的预测块,将待编码块与对应的预测块的相应像素值相减得到残差,并将得到的各待编码块对应的残差组合在一起,得到待编码的图像帧的残差数据。
变换编码102将待编码图像帧的残差数据由空间域变换至频域,获得变换系数,以去除空间信号的相关性,提高编码效率。
量化103是降低数据表示精度的过程。对于通过变换编码102获得的变换系数,通过量化获得量化后的变换系数,能够减少需要编码的数据量,进一步提高压缩效率。
熵编码104通过利用信源的信息熵进行码率压缩。对于量化后的变换系数以及预测编码过程中产生的预测模式信息(包括帧内预测模式、运动矢量信息、参考帧等信息),进行熵编码,能够去除经预测、变换和量化后仍然存在的统计冗余信息,获得码流。
参见图2,给出了一种与上述传统的混合视频编码框架相对应的解码框架:包括熵解码201、反量化202、反变换203以及预测重建204。视频解码是视频编码的逆过程,这里不做详细介绍。
近年来,在视频编码技术中,引入了一些基于神经网络的编码技术。相对于传统技术,基于神经网络的编码技术常常能够获得更优异效果。然而,现有的基于神经网络的视频编码技术大多是基于某个特定的现有编码标准实现的,与该特定的编码标准具有较高的耦合性,降低了基于神经网络的视频编码技术的适用范围。
基于此,本公开提供一种视频编码方法,以克服相关技术所存在的缺陷。如图3所示,给出了本公开所提供的一种视频编码方法的流程图,所述方法包括:
步骤301:对视频数据进行编码处理,所述编码处理包括利用神经网络模型的编码处理。
步骤302:基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征神经网络模型的参数集的信息。
其中,所述对视频数据进行编码处理,可以不依赖于传统的混合视频编码框架,利用端到端的神经网络实现。即,用户可以选择特定的深度学习框架,使用视频数据 构建数据集,对神经网络模型进行训练,以实现能够利用训练完成的神经网络模型,将待编码的视频数据输入至该训练完成的神经网络模型后,能够生成符合一定要求的码流。获得的码流能够被解码,且基于码流重建的视频数据的压缩率、失真度等相关参数能够满足用户的需求。
当然,本领域技术人员应当理解,所述对视频数据进行编码处理,也可以是基于传统的混合视频编码框架实现。
在一些实施例中,所述传统的混合视频编码框架可以是如图1所示的编码框架,其中的部分步骤可以被基于神经网络的技术替换。例如,用基于神经网络的预测编码技术替换传统的混合视频框架中的预测编码、用基于神经网络的变换编码替换传统的混合视频框架中的变换编码、用基于神经网络的量化替换传统的混合视频框架中的量化等等;也可以在传统的混合编码框架中加入基于神经网络的技术,例如:在熵编码之后加入基于神经网络的编码技术等等,本公开对此不做限制。
当然,本领域技术人员应当理解,传统的混合视频编码框架也可以被改进,针对改进后的混合视频编码框架,其中的部分或者全部步骤可以被基于神经网络的编码技术替换,也可以在改进后的混合编码框架中增加基于神经网络的编码技术,本公开对此不做限制。
当然,编码处理所使用的视频编码框架也可以是其他形式,本公开对此不做限制。
在利用包含神经网络模型的编码技术对待编码的视频数据进行编码处理后,能够获得经过编码处理的码流。为了能够基于所述码流,获得重建的视频数据,所生成的码流携带有语法元素,所述语法元素包含表征所述神经网络模型的参数集的信息。基于所述表征神经网络模型的参数集的信息,能够利用神经网络模型,对该编码的码流实现解码,获取重建视频数据。
由于在码流中携带包含表征神经网络模型的参数集的信息的语法元素,所述语法元素可以不依赖于现有的视频编码标准而存在于码流中,进而使得基于神经网络的视频编码技术能够独立实现;所述语法元素也可以位于现有的多种编码标准之中,进而与现有的视频编码标准相兼容。通过本公开实施例的方法,能够降低基于神经网络的智能视频编码技术与特定视频编码标准的耦合性,扩大基于神经网络的视频编码技术的适用范围。
在一些实施例中,所述包含表征神经网络模型的参数集信息的语法元素,位于所述码流的指定位置,所述指定位置为所述码流的预留字段。
在图4中,给出了经过编码处理、存在预留字段的码流401的示意图,其中,码流401除预留字段之外的区域,可以全部是视频数据,也可以是视频数据和其他信息字段,本公开不做限制。另外,本领域技术人员应当理解,码流401还可以存在两个及以上的预留字段,所述包含表征神经网络模型的参数集信息的语法元素被分开携带,本公开对此也不做限制。
在一些实施例中,可以基于预先定义的独立视频编码标准,确定所述预留字段在码流中的位置。
所述预先定义的独立视频编码标准,可以独立于其他视频编码标准,包括ITU-T与ISO/IEC等组织所制定的H.264、H.265、VCC、AVS、AVS+、AVS2、AVS3或AV1等视频编码标准单独使用。例如,当对待编码的视频数据是基于端到端的神经网络模型进行编码处理的,则可以直接将神经网络模型的参数集置于所生成的码流的预留字段。当对编码的码流进行解码时,基于预先定义,可以确定编码的码流的预留字段的具体位置,进而能够对预留字段中的神经网络模型的参数集进行提取,对编码的码流实现解码。
所述预先定义的独立视频编码标准,也可以被其他视频编码标准引用使用。例如:所预先定义的独立视频编码标准,定义了字段“010100011110”为预留字段的起始位置,字段“101011100001”为预留字段的结束位置。则其他视频编码标准也可以引用此定义,以在经过编码处理的码流中确定所述预留字段的具体位置。本领域技术人员应当理解,上述例子仅为示例性说明,所述独立视频编码标准可以以其他形式定义所述预留字段在码流中的具体位置,进而被其他视频编码标准引用,本公开对此不做限制。
在一些实施例中,所述码流基于指定的编码标准生成,所述预留字段为所述指定的编码标准的特定字段。
所述指定的编码标准可以包括:H.264或H.265或VCC或AVS或AVS+或AVS2或AVS3或AV1等等视频编码标准。
所述指定的编码标准可以具有一定的码流结构,如图5所示,给出了一种具体的码流结构,所述码流结构为层次码流结构,包括图像组层、图像层、条带头、宏块 层、块层。每层数据又包括头信息和视频数据信息。因此,所述特定字段可以是所述指定的编码标准的头信息。当然,所述指定的编码标准还可以具有其他的码流结构,所述特定字段可以为所述指定的编码标准预先定义的字段。
针对具体的编码标准,包括H.264或H.265或VCC或AVS或AVS+或AVS2或AVS3或AV1等等视频编码标准,所述特定字段可以是视频参数集(Video Parameter Set,VPS),和\或序列参数集(Sequence Parameter Set,SPS),和\或神经网络模型参数集(Neural Network Parameter Set,NNPS),和\或图像参数集(Picture Parameter Set,PPS),和\或条带头(Slice Header),和\或辅助增强信息(Supplemental Enhancement Information,SEI),和\或语法元素参数集的扩展数据,和\或用户数据,和\或打开比特流单元(Open Bitstream Units,OBU)、序列头、图像组头、图像头、条带头、宏块信息等等。
当然,本领域技术人员应当理解,所给出的指定的编码标准,仅为示例性说明,并非穷举,所述指定的编码标准,还可以是其他编码标准本公开对此不做限制。所述特定字段,也仅为示例性说明,并非穷举,还可以是指定的编码标准的其他特定字段,本公开对此也不做限制。
在一些实施例中,所述预留字段在经过编码处理后的码流中的具体位置,还可以通过以下方式确定:在对视频数据进行编码处理,生成携带有语法元素的码流之前,编码端可以与想要获取码流的对端设备通过有线或者无线通信方式进行协商,确定所述预留字段的具体位置。所述对端设备可以是存储器、解码器等等,本公开对此不做限制。
可见,由于在码流中携带的语法元素可以不依赖于现有的视频编码标准而存在于码流的预留字段中,进而使得基于神经网络的视频编码技术能够独立实现;所述语法元素也可以位于现有的多种编码标准的特定字段之中或者引用预先定义的独立视频编码标准,进而与现有的视频编码标准相兼容。通过本公开实施例的方法,能够降低基于神经网络的智能视频编码技术与特定视频编码标准的耦合性,扩大基于神经网络的视频编码技术的适用范围。
在一些实施例中,所述预留字段位于所述码流的数据包的包头。
当所述语法元素所处的指定位置是基于预先定义的独立视频编码标准所确定的,则所述预留字段可以位于所述码流的数据包的包头,码流结构如图6所示。当预 留字段位于所述码流的数据包的包头时,可以以特定的字符或者字符组作为所述语法元素的结束标志,也可以设定指定字节长度,所述字节长度内的内容为所述语法元素,当所述字节长度不足时,以0补全。
当然,所述预留字段还可以位于所述码流的数据包的中间以及包尾。当预留字段位于码流的数据包的中间或包尾时,可以以特定字符或者字符组等等表征自某个字节的开始和结束;也可以是直接以存储在数据包包头的字节数,表征自某一数量的字节开始,该字节数的字节是用于存储包含表征所述神经网络模型的参数集的信息的语法元素的预留字段。当然,还可以是其他方式,本公开对此不做限制。
对于指定的编码标准,预留字段位于所述码流的数据包的包头,所述数据包的包头可以指该指定编码标准的头信息语法参数集。对于部分指定的编码标准,例如,针对H.26x系统标准(包括H.264、H.265、VCC等等),预留字段一般位于VPS、SPS、NNPS、PPS、Slice Header的扩展数据中,也可以位于SEI中;对于一部分指定的编码标准,例如,针对AVS系列标准(包括VCC、AVS、AVS+、AVS2、AVS3等),所述预留字段可以位于扩展和用户数据语法中;对于一部分指定的编码标准,针对AV1标准,所述预留字段可以位于OBU中,也可以位于扩展数据中。当然,本领域技术人员应当理解,上述示例仅为示例性说明,并非穷举,所述预留字段还可以位于指定的编码标准的其他头部位置,本公开不做限制。
由上述实施例可知,由于在码流中携带的语法元素可以不依赖于现有的视频编码标准而存在于码流的预留字段的包头中,进而使得基于神经网络的视频编码技术能够独立实现;所述语法元素也可以位于现有的多种编码标准的头信息语法参数集之中,进而与现有的视频编码标准相兼容。通过本公开实施例的方法,能够降低基于神经网络的智能视频编码技术与特定视频编码标准的耦合性,扩大基于神经网络的视频编码技术的适用范围。
能够被用于进行编码处理的神经网络模型,通常由多个参数所构成的参数集确定。在一些实施例中,所述神经网络模型的参数集至少包括输入参数、层数、权重参数、超参数、每层的节点数量以及激活函数等参数中的一个及以上。
当编码处理所利用的神经网络模型不同时,所述神经网络模型的参数集包含不同的内容。本公开对所述神经网络模型的参数集所包含的具体参数集参数数量不做具体限制。此外,本领域技术人员应当理解,以上输入参数、层数、权重参数、超参数、每层的节点数量以及激活函数等参数,仅仅为示例性说明,并非穷举,所述神经网络 模型的参数集还可以包括其他用于确定神经网络模型的参数,本公开不做限制。
随着深度学习技术的不断发展,涌现了多种神经网络的通用格式,例如,神经网络交换格式(Neural Network Exchange Format,NNEF)以及开放式神经网络交换(Open Neural Network Exchange,ONNX)等等,用于将神经网络框架映射到推理引擎上。这些通用格式提供接口进行常用的深度学习框架生成的神经网络模型的转换,以实现神经网络模型在不同深度学习框架之间的交互和通用。
为了使编码处理所利用的神经网络模型能够在不同平台以及框架下得到部署和应用,在一些实施例中,所述视频编码方法包括将所述神经网络模型转换为通用格式。
在一些实施例中,所述通用格式可以是NNFF或ONNX。当然,所述通用格式还可以是其他通用格式,以实现将所述神经网络模型在不同的深度学习框架之间通用,本公开对此不做限制。
在一些实施例中,所述语法元素所包含的表征神经网络模型的参数集的信息,可以是将所述神经网络模型的参数集转换为通用格式后对应的信息,即基于所述对应信息,能够再次通过转换,得到所述应用于编码处理中的神经网络模型在不同深度学习框架下的参数集。
在一些实施例中,所述经过编码处理生成的码流所携带的语法元素还包含格式转换使能标识,用于指示在编码过程中,是否将所述神经网络模型转换为通用格式。所述格式转换使能标识,可以用“1”表示使用了格式转换,用“0”表示未使用格式转换。当然,本领域技术人员应当理解,还可以使用其他表示形式以指示在编码过程中是否将神经网络模型转换为通用格式。
随着深度学习技术的不断发展,目前对于神经网络模型,也有一些压缩技术,用于对神经网络模型的参数进行压缩。例如,神经网络表达(Neural Network Representation,NNR)是一种采用类似于视频压缩编码的方式对神经网络模型进行压缩和表示的压缩框架,该压缩框架包括多种压缩技术,如Deflate压缩技术。除了NNR之外,还有一些神经网络压缩框架,例如AITISA的压缩框架等等。
为了降低表征所述神经网络模型的参数集的信息的复杂度,节省存储资源以及传输的带宽资源开销,在一些实施例中,可以对所述表征神经网络模型的参数集的信息进行压缩。
在一些实施例中,所述压缩为基于NNR的压缩框架或AITISA的压缩框架中的压缩技术进行的压缩处理。当然,本领域技术人员应当理解,所述压缩还可以通过其他对神经网络模型的压缩技术实现,本公开对此不作限制。
在一些实施例中,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集压缩后对应的信息,以节省存储资源,降低视频数据传输所占用的带宽资源。
在一些实施例中,所述语法元素还包含压缩使能标识,用于指示在编码过程中,是否将所述神经网络模型的参数集进行压缩。所述压缩使能标识,可以用“1”代表使用了压缩技术,用“0”代表未使用压缩技术。当然,本领域技术人员应当理解,还可以使用其他表示形式以指示在编码过程中是否对神经网络模型的参数集进行了压缩处理。
在一些实施例中,在编码处理过程中,可以将所述神经网络模型转换为通用格式,再对通用格式的神经网络模型的参数集进行压缩,以实现神经网络模型在不同深度学习框架之间的交互和通用,且节省存储资源,降低视频数据传输所占用的带宽资源。
在一些实施例中,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型转换为通用格式并对转换后的神经网络模型的参数集进行压缩后对应的信息。
在一些实施例中,所述所述语法元素还包含格式转换使能标识和压缩使能标识。
如图7所示,给出了当利用神经网络模型对待编码的视频数据进行编码处理的过程中,还应用了神经网络模型格式转换技术以及参数集压缩技术,获取的携带有语法元素的部分码流401a的示意图,所述语法元素包含三个子语法元素,分别是格式转换使能标识、压缩使能标识以及表征神经网络模型的参数集的信息。格式转换使能标识指示所述码流在编码过程中,所利用的神经网络模型被转换为通用格式;所述压缩使能标识,指示所述转换为通用格式的神经网络模型的参数集被压缩;所述表征神经网络模型的参数集的信息,为所述编码过程中,利用的神经网络模型经过转换和压缩后的参数集。三个子语法元素可以位于同一个预留字段,也可以位于多个预留字段,例如,三个子语法元素分别位于三个预留字段。所述预留字段的具体位置可以如前文所述,位于预先定义的独立编码标准所定义的码流位置,或者位于其他标准的特定字段等等,这里不再赘述。所述预留字段与子语法元素的对应关系,可以基于预先定义 的独立编码标准确定,也可以在其他标准中被追加定义确认,还可以是其他标准引用所预先定义的独立编码标准确定等等,本公开对此不做限制。
在一些实施例中,所述语法元素还包括利用神经网络模型进行编码处理的使能标识,用于确定所述编码处理是否使用神经网络模型。
下面,结合一个具体实施例进行说明:
参见图8,为一种包含基于神经网络的环路滤波技术的视频编码框架。当基于该视频编码框架进行编码处理时,过程如下:对于待处理的视频数据的图像帧进行分块,然后执行预测编码101,其中,预测编码101包括帧内预测和帧间预测;对于经过预测编码处理后的数据,继续执行变换编码102和量化103,获得量化后的变换系数;同时,对量化后的数据进行反量化202以及反变换编码203,基于预测编码处理后获得的数据,进行预测重建,之后执行环内滤波801,以提升重建视频数据的图像质量;最后,基于量化的变换系数、预测模式信息以及预测重建过程的相关信息等进行熵编码104,获得编码处理后的码流。
在基于传统混合视频编码框架的编码处理中,环内滤波801通常包括去方块滤波(Deblocking Filter,DF)技术、样本自适应偏移(Sample Adaptive Offset,SAO)技术以及自适应环内滤波(Adative Loop Filter,ALF)技术等传统的环内滤波技术。
图8所示的包含基于神经网络的环内滤波技术的视频编码框架,与传统的混合视频编码框架相比,在环内滤波801中的去方块滤波之后增加了基于神经网络的环路滤波(Neural Network Filter,NNF)技术。原因在于传统的滤波技术对于重建数据的像质提升能力有限,而基于神经网络的滤波技术是通过神经网络模型,应用在训练过程中”学到”的数据之中潜在的“规则”对数据进行处理,通常处理的效果要优于传统的滤波技术。将NNF技术与传统的滤波技术结合使用,往往能够获得质量更好的重建数据。
当然,本领域技术人员应当理解,图8中NNF在环内滤波801中的位置仅为示例性说明,NNF还可以位于环内滤波801的其他位置,例如用在DF之前,ALF之后,以及环内滤波801的ALF与SAO之间等等,且环内滤波801中也可以包含多个NNF,本公开对此不做限制。
当在传统的视频编码框架中加入NNF技术时,在编码处理中,可以默认使用NNF技术。当然,在编码处理过程中,当处理器的处理资源不足,或者对于重建的视 频数据的质量要求较低时,在编码处理中,也可以不使用NNF技术。故经过编码处理所获得的码流中所携带的语法元素,还可以包含利用神经网络模型进行编码处理的使能标识,用于确定所述编码处理是否使用神经网络模型。以图8所示的编码框架为例,对于在熵编码104之后生成的码流,码流所携带的语法元素除了包含表征所述神经网络模型的参数集的信息,还包含利用神经网络模型进行编码处理的使能标识,例如,以“1”指示神经网络模型在编码处理中被使用,以“0”指示神经网络模型在编码处理中未被使用。当然,还可以以其他表示形式以指示在编码处理中是否使用了基于神经网络的技术。
故当所述语法元素还包括利用神经网络模型进行编码处理的使能标识时,接收所生成的码流的对端设备能够根据该使能标识快速确定所述编码处理是否使用神经网络模型,进而加快对所述生成的码流的解码速度。
本领域技术人员应当理解,当所述对视频数据进行编码处理,利用的是端到端的神经网络技术实现,不依赖于传统的混合视频编码框架,则所述码流所携带的语法元素,可以不包括利用神经网络模型进行编码处理的使能标识,获取所述编码处理后的码流的对端设备默认所述码流是经过基于神经网络的技术生成的。
在所述编码处理是基于改进的传统视频编码框架实现的情况下,由于改进的传统视频编码框架可以至少包括预测编码、变换编码、量化、熵编码、预测重建、环内滤波等步骤中的一个,也可以包括其他的编码处理步骤。因此,所述编码处理所利用的神经网络模型,可以位于某个特定步骤之中,也可以位于某两个特定步骤之间等等。
在一些实施例中,所述码流所携带的语法元素,还包括所述神经网络模型在编码处理中的处理时序信息,所述处理时序信息用于指示所述神经网络模型在编码处理过程中的具体位置。
例如,当所述编码处理步骤是基于改进的传统视频编码框架实现的,所述改进的传统视频编码框架包括A、B、C和D四个步骤,待编码的视频数据,依次经过A、B、C、D四个步骤的处理获得码流。则所述码流所携带的语法元素,还可以包括所述神经网络模型在编码处理中的处理时序信息。例如,当所述编码处理包括在步骤A和B之间使用基于神经网络的环内滤波技术时,则所述码流所携带的语法元素,可以包括一个处理时序信息,所述处理时序信息用于指示:所述基于神经网络的环内滤波技术,在编码处理过程的A和B两个步骤之间。所述处理时序信息的形式,可以是特定的字符,例如以“0001”表示基于神经网络的技术位于所述编码处理的A步骤之前;以 “0010”表示基于神经网络的技术位于所述编码处理的A步骤之中;以“0011”表示基于神经网络的技术位于所述编码处理的A步骤和B步骤之间;等等。当然,本领域技术人员应当理解,所述处理时序信息还可以以其他形式表示,本公开对此不做限制。
又例如,当所述编码处理步骤是基于传统的视频编码框架实现的,则所述处理时序信息至少包括以下任一信息:指示在预测编码中的信息、指示在变换编码中的信息、指示在量化中的信息、指示在熵编码中的信息、指示在预测编码之前的信息、指示在预测编码和变换编码之间的信息、指示在变换编码和量化之间的信息、指示在量化和熵编码之间的信息、指示在熵编码之后的信息;等等。
所述处理时序信息的具体形式,可以是特定的字符,例如:以“000001”指示基于神经网络的技术在预测编码中使用、以“000010”指示基于神经网络的技术在变换编码中使用、以“001000”指示基于神经网络的技术在预测编码和变换编码之间使用,等等。当然,本领域技术人员应当理解,所述处理时序信息的具体形式还可以是其他表示方式,本公开不做限制。
此外,本领域技术人员还应当理解,对于基于传统的视频编码框架、利用神经网络模型进行的编码处理,上述所述处理时序信息,仅为示例性说明,并非穷举,所述处理时序信息还可以包括其他内容,以指示在基于传统的视频编码框架、利用神经网络模型进行的编码处理中,所述神经网络模型在编码处理中的具体位置。
故当所述语法元素还包括所述处理时序信息时,接收所生成的码流的对端设备能够根据该时序信息确定基于神经网络的技术在编码处理过程中的哪个具体位置被使用,进而能够基于此信息对所述生成的码流进行正确的解码。
在基于传统的视频编码框架的编码处理中,如前文所述,可以在图8所示的环内滤波801中可以加入基于神经网络的环内滤波技术,以提升重建数据的滤波效果;相应地,所述处理时序信息为指示在预测重建之后的信息。
除了利用基于神经网络的环内滤波技术,还可以使用其他基于神经网络的技术。继续结合图8进行举例说明:
在熵编码104中,可以采用基于神经网络技术的上下文概率估计方法替换传统的基于规则的上下文概率预测模型,提高熵编码的效率;相应地,所述处理时序信息为指示在熵编码中的信息。
在预测编码101的帧间预测阶段,可以采用基于神经网络的图像超分辨技术替 换传统的运动估计技术,以提升运动估计性能,进而提高帧间预测的效率;相应地,所述处理时序信息为指示在预测编码的帧间预测阶段的信息。
在预测编码101的帧内预测阶段,可以采用基于神经网络的帧内预测技术,与传统的帧内预测技术进行比较,决策出最优的帧内预测方法进行帧内预测;相应地,所述处理时序信息为在预测编码的帧内预测阶段的信息。
在预测重建中,可以添加基于神经网络的超分辨率重建技术,获取具有较高质量的重建视频数据;所述处理时序信息为指示在熵编码之后的信息。
本领域技术人员应当理解,上述编码处理中,可以利用的具体的神经网络模型,仅仅为示例性说明,并非枚举,对待编码视频数据的编码处理,还可以包括其他基于神经网络的技术,相应地,所述处理时序信息也可以是其他内容的处理时序信息,本公开对此不做限制。
在基于传统的视频编码框架、利用神经网络模型进行的编码处理中,可以包括多个神经网络模型,所述多个神经网络模型,可以是前文所述的多个具体的神经网络模型,还可以是其他神经网络模型。
在编码处理利用多个神经网络模型对待编码的视频数据进行编码处理的情况下,所述语法元素所包含的利用神经网络模型进行编码处理的使能标识,可以为多个,用于分别指示对应的神经网络模型是否被用于对视频数据进行编码处理。当然,所述语法元素所包含的利用神经网络模型进行编码处理的使能标识,也可以有一个总的使能标识和一定数量的、与每个神经网络模型一一对应的使能标识,本公开对此不做限制。
在编码处理利用多个神经网络模型对待编码的视频数据进行编码处理的情况下,所述码流包含的语法元素,还可以包括所述神经网络模型的标识信息。通过所述标识信息,可以唯一确定具体的神经网络模型。
在一些实施例中,所述标识信息可以为索引值。例如,以索引值“00000001”指示基于神经网络的环内滤波技术所使用的神经网络模型;以索引值“00000010”指示基于神经网络的帧内预测技术所使用的神经网络模型;以索引值“00000011”指示基于神经网络的图像超分辨率技术所使用的神经网络模型;等等。
当然,所述标识信息还可以是其他形式,用于唯一确定具体的基于神经网络的技术所对应的神经网络模型,以使得获取码流的对端设备,能够基于所述标识信息, 可以唯一确定编码处理过程中,使用的是哪个具体神经网络模型。
在利用神经网络模型的编码处理中,当所述编码处理并未将所述神经网络模型转换为通用格式,则所述码流携带的语法元素,还可以包括所述用于编码处理的神经网络模型的深度学习的框架信息,指示所述神经网络模型所使用的框架。
目前,神经网络模型所使用的深度学习框架,有许多种,包括TensorFlow、Pytorch、Caffe2、Microsoft Cognitive Toolkit、Apache MXNet、AI硬件加速器等等。每种深度学习框架都有着自己独有的优缺点,例如:TensorFlow是一个开放源代码软件库,几乎能满足所有机器学习开发的功能,但是其功能代码过于底层,学习成本高;PyTorch是一个快速便于实验深度学习框架,但是其高度封装,不够灵活;等等。因此,编码处理过程中所应用的神经网络的深度学习框架,可能出于应用场景、处理器资源等因素的考虑,为特定的深度学习框架。而所述码流的对端设备,可能由于某些原因不能采用与编码处理过程中相同的深度学习框架,或者存在多种可选择的深度学习框架。因此,编码处理所生成的码流所携带的语法元素,还可以包括编码处理中所使用的神经网络模型的深度学习框架信息,以方便获取所述码流的对端设备,根据所述框架信息,确认是否可以解码。进一步地,如果由于不存在相应的框架而导致不能解码,则可以生成解码失败信息,包含解码失败的原因。所述深度学习框架信息,可以以字符指示具体的深度学习框架,也可以是其他表示形式,本公开不做限制。
在一些实施例中,所述编码处理还包括:确定神经网络框架和视频编码器;基于所述神经网络和视频编码器进行训练,获得所述神经网络模型。
在步骤301中,所述用于编码处理的神经网络模型,为训练完成的神经网络模型。为了确定该神经网络模型,可以预先选择神经网络模型所使用的深度学习框架和视频编码器,并构造样本数据。利用所构建的样本数据,基于所选择的深度学习框架和视频编码器进行训练,当满足预设的条件时,获得所述训练完成的神经网络模型,用于对待编码的视频数据进行编码处理。
其中,所述深度学习框架,可以是TensorFlow、Pytorch、Caffe2、Microsoft Cognitive Toolkit、Apache MXNet等等,还可以是AI硬件加速器。当然,还可以是其他类型的深度学习框架。所述视频编码器,可以是VTM、HM、JM等视频编码标准参考软件平台。所述预设的条件,可以是损失函数最小或者收敛等等。本领域技术人员应当理解,上述举例仅为示例性说明,本公开对深度学习框架的具体类型、视频编码器以及预设的条件,不做具体限制。
下面,以在基于传统的混合视频编码框架的编码处理中,包含两个基于神经网络的技术为例,结合图9,介绍本公开所述的视频编码方法。所述两个基于神经网络的技术,分别为位于图9所示的位于编码处理中的基于神经网络的环内滤波技术和位于数据重建之后的基于神经网络的超分辨率重建技术。所述编码处理包括图8所示的预测编码101、变换编码102、量化103、预测重建。所述基于神经网络的环内滤波技术替换混合视频编码技术中的传统的环内滤波技术。所述基于神经网络的超分辨率重建技术用于实现上采样,以获得具有更高图像质量的重建视频数据。
在对待编码的视频数据进行编码处理之前,首先进行下采样,获取采样视频数据。然后,基于编码处理对所述采样视频数据进行编码。其中,在进行预测重建之后,基于神经网络的环内滤波技术对数据进行滤波,之后采用基于神经网络的超分辨率重建技术对滤波后的数据进行上采样,最获得重建视频数据。在对待编码的视频数据完成编码处理后,所生成的码流携带有语法元素。
所述语法元素可以如图10所示,包含神经网络模型的标识信息、利用神经网络模型进行编码处理的使能标识、格式转换使能标识、压缩使能标识、处理时序信息以及将神经网络的参数集进行格式转换和压缩后的信息等多个语法子元素。在每个语法子元素内,可以包含两套信息,按照信息的先后次数,分别指示基于神经网络的环内滤波技术和基于神经网络的超分辨率重建技术的相关信息。例如,“神经网络模型的标识信息”中,第一个标识信息为“00000001”,指示所对应的神经网络模型为基于神经网络的环内滤波技术所使用的神经网络模型;第二个标识信息为“00000011”,指示所对应的神经网络模型为基于神经网络的图像超分辨率技术所使用的神经网络模型。其他子语法元素内的信息与神经网络模型的指示对应关系,与“神经网络模型的标识信息”相同,这里不再赘述。所述多个子语法元素可以位于所述码流的同一预留字段,也可以位于所述码流的多个预留字段。所述预留字段如前文所述,这里不再赘述。
图10所示的语法元素也可以作为一个语法参数单元,每个语法参数单元用于存储一个神经网络模型的以下信息:神经网络模型的标识信息、利用神经网络模型进行编码处理的使能标识、格式转换使能标识、压缩使能标识、处理时序信息以及将神经网络的参数集进行格式转换和压缩后的信息。对于包含两个基于神经网络的技术,在所述码流包含两个所述语法参数单元。所述两个语法参数单元可以按照次序位于同一个预留字段,也可以位于不同的预留字段。所述预留字段如前文所述,这里不再赘述。
本领域技术人员应当理解,上述例子仅为示例性说明,所述语法元素和所述预留字段,还可以包括前文所述的相关内容。
通过上述实施例可以看到,在码流中携带的语法元素,可以不依赖于现有的视频编码标准而存在于码流中,进而使得基于神经网络的视频编码技术能够独立实现;所述语法元素也可以位于现有的多种编码标准之中,进而与现有的视频编码标准相兼容。通过本公开实施例的方法,能够降低基于神经网络的智能视频编码技术与特定视频编码标准的耦合性,扩大基于神经网络的视频编码技术的适用范围。
所述语法元素,除了可以包含表征神经网络模型的参数集的信息,还可以包括利用神经网络模型进行编码处理的使能标识,以指示所述编码处理是否使用了基于神经网络的技术,以使获取码流的对端设备实现快速解码;还可以包含格式转换使能标识,以指示编码处理过程中,神经网络模型是否进行了格式转换,以实现在不同的深度学习框架中的通用;还可以包括压缩使能标识,以指示编码处理过程中,神经网络模型的参数是否进行了压缩,以节省存储空间和通信资源;还可以包括处理时序信息,以指示编码处理过程中,应用基于神经网络的技术的具体位置;还可以包括神经网络模型的标识信息,以在编码处理过程中利用到多个神经网络模型时,对单个神经网络模型唯一指示;等等。
本领域技术人员应当理解,上文关于所述语法元素的举例,仅为示例性说明,并非穷举。基于所述的语法元素,能够对神经网络模型在编码处理中的相关信息进行表征,进而能够使接收所述码流的对端设备对所述码流进行准确、快速的解码,获得高质量的解码效果。
此外,如图11所示,本公开还提供了一种基于神经网络的视频编码方法,所述方法包括:
步骤1101:利用神经网络模型对视频数据进行编码处理;
步骤1102:将所述神经网络模型的参数集发送给视频编码器,以使所述视频编码器基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络模型的参数集的信息。
步骤1101中,用于对视频数据进行编码处理的所述神经网络模型,可以由多个参数所构成的参数集确定。在一些实施例中,所述神经网络模型的参数集包括输入参数、层数、权重参数、超参数、每层的节点数量以及激活函数等参数中的一个及以 上。
当编码处理所利用的神经网络模型不同时,所述神经网络模型的参数集包含不同的内容。本公开对所述神经网络模型的参数集所包含的具体参数集参数数量不做具体限制。此外,本领域技术人员应当理解,以上输入参数、层数、权重参数、超参数、每层的节点数量以及激活函数等参数,仅仅为示例性说明,并非穷举,所述神经网络模型的参数集还可以包括其他用于确定神经网络模型的参数,本公开不做限制。
在一些实施例中,步骤1101中的利用神经网络模型对视频数据进行的编码处理,可以不依赖于传统的混合视频编码框架,利用端到端的神经网络实现。即,用户可以选择特定的深度学习框架,使用视频数据构建数据集,对神经网络模型进行训练,以实现能够利用训练完成的神经网络模型,将待编码的视频数据输入至该训练完成的神经网络模型后,能够生成符合一定要求的码流。获得的码流能够被解码,且基于码流重建的视频数据的压缩率、失真度等相关参数能够满足用户的需求。
在一些实施例中,步骤1101中的利用神经网络模型对视频数据进行的编码处理,可以结合如图1所示的传统的混合编码框架实现。所述利用神经网络模型对视频数据进行编码处理,包括以下至少一个步骤:
在预测编码的帧内预测阶段,执行基于神经网络的帧内预测技术;
在预测编码的帧间预测阶段,执行基于神经网络的图像超分辨率技术,用于进行帧间运动估计;
在预测重建之后,执行基于神经网络的环内滤波技术;
在熵编码之后,执行基于神经网络的图像超分辨率技术,用于获取重建图像;
在熵编码中,执行基于神经网络的上下文概率估计技术。
本领域技术人员应当理解,上述编码处理中,可以利用的具体的神经网络模型,仅仅为示例性说明,并非枚举,对待编码视频数据的编码处理,还可以包括其他基于神经网络的技术,相应地,所述神经网络模型在传统编码框架中的具体位置,也可以根据实际情况适应性地确定,本公开对此不做限制。
当然,步骤1101中的利用神经网络模型对视频数据进行的编码处理,还可以结合其他改进的混合编码框架实现,相应的,编码处理过程中,所采用的具体的神经网络模型和在混合编码框架中的具体位置,可以根据实际情况适应性地确定,本公开 对此不做限制。
本方法所述的码流所携带的语法元素的相关内容,可以如前文所述的一种视频编码方法中的语法元素的相关内容相同。
在一些实施例中,所述生成的码流所携带的语法元素,可以位于所述码流的指定位置,所述指定位置为所述码流的预留字段。
在一些实施例中,所述码流位于的预留字段,可以位于所述码流的数据包的包头。
在一些实施例中,所述语法元素包含的表征所述神经网络模型的参数集的信息,为将所述神经网络模型的参数集转换为通用格式后对应的信息。
在一些实施例中,所述语法元素还包含格式转换使能标识,用于指示将所述神经网络模型转换为所述通用格式。
在一些实施例中,所述语法元素所包含的表征所述神经网络模型的参数集的信息,为将所述神经网络模型的参数集压缩后对应的信息。
在一些实施例中,所述语法元素还包含压缩使能标识,用于指示对神经网络模型的参数集进行压缩。
在一些实施例中,所述语法元素所包含的表征所述神经网络模型的参数集的信息,为将所述神经网络模型的参数集转换为通用格式并进行压缩后对应的信息。
在一些实施例中,所述语法元素还包括利用神经网络模型进行编码处理的使能标识,用于确定所述编码处理是否使用所述神经网络。
在一些实施例中,所述语法元素还包括所述神经网络模型在所述编码处理中的处理时序信息,所述处理时序信息用于指示所述神经网络模型在编码处理过程中的具体位置。
在一些实施例中,当利用神经网络模型对视频数据进的编码处理,是基于传统的混合视频编码框架或者改进的混合视频编码框架,则所述语法元素所包含的处理时序信息至少包括以下任一信息:指示在预测编码中的信息、指示在变换编码中的信息、指示在量化中的信息、指示在熵编码中的信息、指示在预测编码之前的信息、指示在预测编码和变换编码之间的信息、指示在变换编码和量化之间的信息、指示在量化和熵编码之间的信息、指示在熵编码之后的信息。
在一些实施例中,所述语法元素还包括所述神经网络模型的标识信息。
在一些实施例中,所述语法元素还包括所述神经网络模型的框架信息,所述框架信息用于指示所述神经网络模型所使用的框架。
在一些实施例中,所述码流基于指定的编码标准生成,所述语法元素所位于的预留字段,位于所述码流的视频参数集,和\或序列参数集,和\或神经网络模型参数集,和\或图像参数集,和\或条带头,和\或辅助增强信息,和\或语法元素参数集的扩展数据,和\或用户数据,和\或打开比特流单元,和\或序列头,和\或图像头,和\或图像组头,和\或条带头,和\或宏块信息。
针对具体的编码标准,包括H.264或H.265或VCC或AVS或AVS+或AVS2或AVS3或AV1等等视频编码标准,所述预留字段可以是这些视频编码标准的特定字段,可以是视频参数集(Video Parameter Set,VPS),和\或序列参数集(Sequence Parameter Set,SPS),和\或神经网络模型参数集(Neural Network Parameter Set,NNPS),和\或图像参数集(Picture Parameter Set,PPS),和\或条带头(Slice Header),和\或辅助增强信息(Supplemental Enhancement Information,SEI),和\或语法元素参数集的扩展数据,和\或用户数据,和\或打开比特流单元(Open Bitstream Units,OBU)、序列头、图像组头、图像头、条带头、宏块信息等等。
当然,本领域技术人员应当理解,所给出的指定的编码标准,仅为示例性说明,并非穷举,所述指定的编码标准,还可以是其他编码标准本公开对此不做限制。所述特定字段,也仅为示例性说明,并非穷举,还可以是指定的编码标准的其他特定字段,本公开对此也不做限制。
上述实施例的详细内容,在前文所述的一种视频编码方法中已经详细介绍,这里不再赘述。
为了使编码处理所利用的神经网络模型能够在不同平台以及框架下得到部署和应用,在一些实施例中,所述利用神经网络模型对视频数据进行编码处理的方法还包括:将所述神经网络模型转换为通用格式。
通过将神经网络模型转换为通用格式,能够将神经网络框架映射到推理引擎上。这些通用格式提供接口进行常用的深度学习框架生成的神经网络模型的转换,以实现神经网络模型在不同深度学习框架之间的交互和通用。
在一些实施例中,所述通用格式包括NNFF或ONNX。当然,所述通用格式还 可以是其他通用格式,以实现将所述神经网络模型在不同的深度学习框架之间通用,本公开对此不做限制。
为了降低表征所述神经网络模型的参数集的信息的复杂度,节省存储资源以及传输的带宽资源开销,在一些实施例中,所述基于神经网络的视频编码方法还包括:对所述表征所述神经网络的参数集的信息进行压缩。
在一些实施例中,所述压缩为基于NNR的压缩框架或AITISA的压缩框架中的压缩技术进行的压缩处理。当然,本领域技术人员应当理解,所述压缩还可以通过其他对神经网络模型的压缩技术实现,本公开对此不作限制。
在一些实施例中,所述神经网络模型通过以下方式确定:确定神经网络框架和视频编码器;基于所述神经网络和视频编码器进行训练,获得所述神经网络模型。
在步骤301中,所述用于编码处理的神经网络模型,为训练完成的神经网络模型。为了确定该神经网络模型,可以预先选择神经网络模型所使用的深度学习框架和视频编码器,并构造样本数据。利用所构建的样本数据,基于所选择的深度学习框架和视频编码器进行训练,当满足预设的条件时,获得所述训练完成的神经网络模型,用于对待编码的视频数据进行编码处理。
在一些实施例中,所述神经网络模型的深度学习框架可以是TensorFlow、Pytorch、Caffe2、Microsoft Cognitive Toolkit、Apache MXNet等等,还可以是AI硬件加速器。当然,还可以是其他类型的深度学习框架。所述视频编码器,可以是VTM、HM、JM等视频编码标准参考软件平台。所述预设的条件,可以是损失函数最小或者收敛等等。本领域技术人员应当理解,上述举例仅为示例性说明,本公开对深度学习框架的具体类型、视频编码器以及预设的条件,不做具体限制。
由上述实施例可知,由于在码流中携带包含表征神经网络模型的参数集的信息的语法元素,所述语法元素可以不依赖于现有的视频编码标准而存在于码流中,进而使得基于神经网络的视频编码技术能够独立实现;所述语法元素也可以位于现有的多种编码标准之中,进而与现有的视频编码标准相兼容。通过本公开实施例的方法,能够降低基于神经网络的智能视频编码技术与特定视频编码标准的耦合性,扩大基于神经网络的视频编码技术的适用范围。
与本公开所提供的一种视频编码方法相对应,如图12所示,本公开还提供了一种视频解码方法,所述方法包括:
步骤1201:对所接收的视频码流进行解析,获取所述视频码流的语法元素,所述语法元素包含表征神经网络模型的参数集的信息;
步骤1202:根据所述语法元素,利用与表征参数集的信息对应的神经网络模型对所述视频码流进行解码处理。
与相应的编码处理相对应,所述对视频数据进行的解码处理,可以不依赖于传统的混合视频解码框架,而是利用端到端的神经网络实现。
当然,本领域技术人员应当理解,所述对视频数据进行的解码处理,也可以是基于传统的混合视频解码框架实现。在一些实施例中,所述传统的混合视频解码框架可以是如图2所示的解码框架。
当然,本领域技术人员应当理解,传统的混合视频解码框架也可以被改进为其他形式,本公开对用于解码的视频解码框架不做限制。
与相应的编码处理所获得的码流携带的语法元素相对应,所述语法元素的相关内容同在前文所述的一种视频解码方法中的语法元素的相关内容相同。
在一些实施例中,所述语法元素,可以位于所述码流的指定位置,所述指定位置为所述码流的预留字段。
在一些实施例中,所述语法元素位于的预留字段,可以位于所述码流的数据包的包头。
在一些实施例中,所述神经网络模型的参数集包括输入参数、层数、权重参数、超参数、每层的节点数量以及激活函数等参数中的一个及以上。
本领域技术人员应当理解,以上输入参数、层数、权重参数、超参数、每层的节点数量以及激活函数等参数,仅仅为示例性说明,并非穷举,所述神经网络模型的参数集还可以包括其他用于确定神经网络模型的参数,本公开不做限制。
在一些实施例中,所述语法元素包含的表征所述神经网络模型的参数集的信息,为将所述神经网络模型的参数集转换为通用格式后对应的信息。解码端在进行解码时,可以默认将通过格式的神经网络模型的参数集转换为解码端能够使用的神经网络模型,用于解码。
在一些实施例中,所述语法元素还包含格式转换使能标识,用于指示将所述神经网络模型转换为所述通用格式。故解码端能够基于该格式转换使能标识,确定在解 码过程中是否将神经网络模型的参数集信息进行转换。
在一些实施例中,所述语法元素还可以包括表征所述神经网络模型的框架的信息。
在一些实施例中,所述神经网络模型的框架可以包括:TensorFlow或Pytorch或Caffe2或AI硬件加速器。当然,还可以是其他类型的深度学习框架,本申请对此不做限制。
在一些实施例中,所述语法元素所包含的表征所述神经网络模型的参数集的信息,为将所述神经网络模型的参数集压缩后对应的信息。解码端在进行解码时,可以默认将神经网络模型的参数集进行解压,用于解码。
在一些实施例中,所述语法元素还包含压缩使能标识,用于指示对神经网络模型的参数集进行压缩。故解码端能够基于该压缩使能标识,确定在解码过程中是否将神经网络模型的参数集信息进行相应的解压。
在一些实施例中,所述语法元素所包含的表征所述神经网络模型的参数集的信息,为将所述神经网络模型的参数集转换为通用格式并进行压缩后对应的信息。解码端在进行解码时,可以默认将神经网络模型的参数集进行转换和解压,用于解码。
在一些实施例中,所述语法元素还包括利用神经网络模型进行编码处理的使能标识,用于确定所述编码处理是否使用所述神经网络。因此,解码端在进行解码时,能够基于该使能标识,确认解码处理是否使用相应的神经网络。
在一些实施例中,所述语法元素还包括所述神经网络模型在所述编码处理中的处理时序信息,所述处理时序信息用于指示所述神经网络模型在编码处理过程中的具体位置。因此,解码端在进行解码时,能够基于该处理时序信息,确定在解码过程中的哪些具体位置使用基于神经网络的技术进行相应的解码。
在一些实施例中,当利用神经网络模型对视频数据进的编码处理,是基于传统的混合视频编码框架或者改进的混合视频编码框架,则所述语法元素所包含的处理时序信息至少包括以下任一信息:指示在预测编码中的信息、指示在变换编码中的信息、指示在量化中的信息、指示在熵编码中的信息、指示在预测编码之前的信息、指示在预测编码和变换编码之间的信息、指示在变换编码和量化之间的信息、指示在量化和熵编码之间的信息、指示在熵编码之后的信息。故解码端能够基于上述处理时序信息,确定在解码过程中的哪些具体位置使用基于神经网络的技术进行相应的解码。
在一些实施例中,所述语法元素还包括所述神经网络模型的标识信息。
在一些实施例中,所述码流基于指定的编码标准生成,所述语法元素所位于的预留字段,位于所述码流的视频参数集,和\或序列参数集,和\或神经网络模型参数集,和\或图像参数集,和\或条带头,和\或辅助增强信息,和\或语法元素参数集的扩展数据,和\或用户数据,和\或打开比特流单元,和\或序列头,和\或图像头,和\或图像组头,和\或条带头,和\或宏块信息。
针对具体的编码标准,包括H.264或H.265或VCC或AVS或AVS+或AVS2或AVS3或AV1等等视频编码标准,所述预留字段可以是这些视频编码标准的特定字段,可以是视频参数集(Video Parameter Set,VPS),和\或序列参数集(Sequence Parameter Set,SPS),和\或图像参数集(Picture Parameter Set,PPS),和\或神经网络模型参数集(Neural Network Parameter Set,NNPS),和\或条带头(Slice Header),和\或辅助增强信息(Supplemental Enhancement Information,SEI),和\或语法元素参数集的扩展数据,和\或用户数据,和\或打开比特流单元(Open Bitstream Units,OBU)、序列头、图像组头、图像头、条带头、宏块信息等等。
当然,本领域技术人员应当理解,所给出的指定的编码标准,仅为示例性说明,并非穷举,所述指定的编码标准,还可以是其他编码标准本公开对此不做限制。所述特定字段,也仅为示例性说明,并非穷举,还可以是指定的编码标准的其他特定字段,本公开对此也不做限制。
上述实施例的详细内容,在前文所述的一种视频编码方法中已经详细介绍,这里不再赘述。
在一些实施例中,编码端为了使编码处理所利用的神经网络模型能够在不同平台以及框架下得到部署和应用,编码端对视频数据进行的编码处理还包括:将所述神经网络模型转换为通用格式。相应地,所述视频解码方法还包括:所述格式转换使能标识指示神经网络模型转换为所述通用格式时,将处于通用格式的神经网络模型转换为指定框架的神经网络模型,以应用于所述码流的解码。
在一些实施例中,所述通用格式包括NNFF或ONNX。当然,所述通用格式还可以是其他通用格式,以实现将所述神经网络模型在不同的深度学习框架之间通用,本公开对此不做限制。
为了降低表征所述神经网络模型的参数集的信息的复杂度,节省存储资源以及 传输的带宽资源开销,在一些实施例中,编码端会对所述表征所述神经网络的参数集的信息进行压缩。相应地,所述视频解码方法还包括:所述压缩使能标识指示对神经网络模型的参数集进行压缩时,对压缩的神经网络模型的参数集进行解压。
在一些实施例中,所述所述解压为基于与NNR的压缩框架或AITISA的压缩框架中的压缩技术对应的解压技术进行的解压处理。当然,本领域技术人员应当理解,所述解压还可以通过其他对神经网络模型的解压技术实现,本公开对此不作限制。
在一些实施例中,编码端可以基于神经网络的技术对视频数据进行编码。相应地,所述视频解码方法还可以包括以下至少一个步骤:
利用神经网络模型进行解码处理为基于神经网络的环内滤波技术,所述处理时序信息为指示在预测重建之后的信息;
利用神经网络模型进行解码处理为基于神经网络的帧内预测技术,所述处理时序信息为指示在预测重建的帧内预测的信息;
利用神经网络模型进行解码处理为基于神经网络的图像超分辨率技术,用于进行帧间运动估计,所述处理时序信息为指示在预测重建的帧间预测阶段的信息;
利用神经网络模型进行解码处理为基于神经网络的图像超分辨率技术,用于获取重建图像,所述处理时序信息为指示在预测重建之后的信息;
利用神经网络模型进行解码处理为基于神经网络的上下文概率估计技术,所述处理时序信息为指示在熵解码中的信息;等等。
本领域技术人员应当理解,上述解码处理中,可以利用的具体的神经网络模型,仅仅为示例性说明,并非枚举,对码流的解码处理,还可以包括其他基于神经网络的技术,相应地,所述处理时序信息也可以是其他内容的处理时序信息,本公开对此不做限制。
在一些实施例中,所述解码处理基于视频解码器实现。所述解码器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序。
可见,由于在码流中携带的语法元素可以不依赖于现有的视频编码标准而存在于码流的预留字段中,进而使得基于神经网络的视频解码技术能够独立实现;所述语法元素也可以位于现有的多种编码标准的特定字段之中或者引用预先定义的独立视频编码标准,进而与现有的视频编码标准相兼容。通过本公开实施例的方法,能够降低 基于神经网络的智能视频解码技术与特定视频编码标准的耦合性,扩大基于神经网络的视频解码技术的适用范围。
此外,如图13所示,本公开还提供了一种基于神经网络的视频解码方法,所述方法包括:
步骤1301:获取视频解码器对视频码流解析后获得的语法元素,所述语法元素包含表征神经网络模型的参数集的信息;
步骤1302:根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码处理。
步骤1301中,用于对视频数据进行编码处理的所述神经网络模型,可以由多个参数所构成的参数集确定。在一些实施例中,所述神经网络模型的参数集包括输入参数、层数、权重参数、超参数、每层的节点数量以及激活函数等参数中的一个及以上。
本领域技术人员应当理解,以上输入参数、层数、权重参数、超参数、每层的节点数量以及激活函数等参数,仅仅为示例性说明,并非穷举,所述神经网络模型的参数集还可以包括其他用于确定神经网络模型的参数,本公开不做限制。
在一些实施例中,步骤1302中的利用与所述参数集对应的神经网络模型对所述视频码流进行解码处理,可以结合如图2所示的传统的混合编码框架实现。利用神经网络模型对视频数据进行解码处理,可以包括以下至少一个步骤:
在预测重建的帧内预测阶段,执行基于神经网络的帧内预测技术;
在预测重建的帧间预测阶段,执行基于神经网络的图像超分辨率技术,用于进行帧间运动估计;
在预测重建之后,执行基于神经网络的环内滤波技术;
在预测重建之后,执行基于神经网络的图像超分辨率技术,用于获取重建图像;
在熵解码中,执行基于神经网络的上下文概率估计技术。
本领域技术人员应当理解,上述解码处理中,可以利用的具体的神经网络模型,仅仅为示例性说明,并非枚举,对所述视频码流的解码处理,还可以包括其他基于神经网络的技术,相应地,所述神经网络模型在传统解码框架中的具体位置,也可以根据编码端的实际情况适应性地确定,本公开对此不做限制。
当然,步骤1302中的利用神经网络模型对视频码流进行的解码处理,还可以结合其他改进的混合解码框架实现,相应的,解码处理过程中,所采用的具体的神经网络模型和在混合解码框架中的具体位置,可以根据编码端的实际情况适应性地确定,本公开对此不做限制。
本方法所述的码流所携带的语法元素的相关内容,可以如前文所述的一种视频编码方法中的语法元素的相关内容相同。
在一些实施例中,所述生成的码流所携带的语法元素,可以位于所述码流的指定位置,所述指定位置为所述码流的预留字段。
在一些实施例中,所述码流位于的预留字段,可以位于所述码流的数据包的包头。
在一些实施例中,所述语法元素包含的表征所述神经网络模型的参数集的信息,为将所述神经网络模型的参数集转换为通用格式后对应的信息。解码端在进行解码时,可以默认将通过格式的神经网络模型的参数集转换为解码端能够使用的神经网络模型,用于解码。
在一些实施例中,所述语法元素还包含格式转换使能标识,用于指示将所述神经网络模型转换为所述通用格式。故解码端能够基于该格式转换使能标识,确定在解码过程中是否将神经网络模型的参数集信息进行转换。
在一些实施例中,所述语法元素还可以包括表征所述神经网络模型的框架的信息。
在一些实施例中,所述神经网络模型的框架可以包括:TensorFlow或Pytorch或Caffe2或AI硬件加速器。当然,还可以是其他类型的深度学习框架,本申请对此不做限制。
在一些实施例中,所述语法元素所包含的表征所述神经网络模型的参数集的信息,为将所述神经网络模型的参数集压缩后对应的信息。解码端在进行解码时,可以默认将神经网络模型的参数集进行解压,用于解码。
在一些实施例中,所述语法元素还包含压缩使能标识,用于指示对神经网络模型的参数集进行压缩。故解码端能够基于该压缩使能标识,确定在解码过程中是否将神经网络模型的参数集信息进行相应的解压。
在一些实施例中,所述语法元素所包含的表征所述神经网络模型的参数集的信息,为将所述神经网络模型的参数集转换为通用格式并进行压缩后对应的信息。解码端在进行解码时,可以默认将神经网络模型的参数集进行转换和解压,用于解码。
在一些实施例中,所述语法元素还包括利用神经网络模型进行编码处理的使能标识,用于确定所述编码处理是否使用所述神经网络。因此,解码端在进行解码时,能够基于该使能标识,确认解码处理是否使用相应的神经网络。
在一些实施例中,所述语法元素还包括所述神经网络模型在所述编码处理中的处理时序信息,所述处理时序信息用于指示所述神经网络模型在编码处理过程中的具体位置。因此,解码端在进行解码时,能够基于该处理时序信息,确定在解码过程中的哪些具体位置使用基于神经网络的技术进行相应的解码。
在一些实施例中,当利用神经网络模型对视频数据进的编码处理,是基于传统的混合视频编码框架或者改进的混合视频编码框架,则所述语法元素所包含的处理时序信息至少包括以下任一信息:指示在预测编码中的信息、指示在变换编码中的信息、指示在量化中的信息、指示在熵编码中的信息、指示在预测编码之前的信息、指示在预测编码和变换编码之间的信息、指示在变换编码和量化之间的信息、指示在量化和熵编码之间的信息、指示在熵编码之后的信息。故解码端能够基于上述处理时序信息,确定在解码过程中的哪些具体位置使用基于神经网络的技术进行相应的解码。
在一些实施例中,所述语法元素还包括所述神经网络模型的标识信息。
在一些实施例中,所述码流基于指定的编码标准生成,所述语法元素所位于的预留字段,位于所述码流的视频参数集,和\或序列参数集,和\或神经网络模型参数集,和\或图像参数集,和\或条带头,和\或辅助增强信息,和\或语法元素参数集的扩展数据,和\或用户数据,和\或打开比特流单元,和\或序列头,和\或图像头,和\或图像组头,和\或条带头,和\或宏块信息。
针对具体的编码标准,包括H.264或H.265或VCC或AVS或AVS+或AVS2或AVS3或AV1等等视频编码标准,所述预留字段可以是这些视频编码标准的特定字段,可以是视频参数集(Video Parameter Set,VPS),和\或序列参数集(Sequence Parameter Set,SPS),和\或神经网络模型参数集(Neural Network Parameter Set,NNPS),和\或图像参数集(Picture Parameter Set,PPS),和\或条带头(Slice Header),和\或辅助增强信息(Supplemental Enhancement Information,SEI),和\或语法元素参数 集的扩展数据,和\或用户数据,和\或打开比特流单元(Open Bitstream Units,OBU)、序列头、图像组头、图像头、条带头、宏块信息等等。
当然,本领域技术人员应当理解,所给出的指定的编码标准,仅为示例性说明,并非穷举,所述指定的编码标准,还可以是其他编码标准本公开对此不做限制。所述特定字段,也仅为示例性说明,并非穷举,还可以是指定的编码标准的其他特定字段,本公开对此也不做限制。
上述实施例的详细内容,在前文所述的一种视频编码方法中已经详细介绍,这里不再赘述。
在一些实施例中,编码端为了使编码处理所利用的神经网络模型能够在不同平台以及框架下得到部署和应用,编码端对视频数据进行的编码处理还包括:将所述神经网络模型转换为通用格式。相应地,所述视频解码方法还包括:所述格式转换使能标识指示神经网络模型转换为所述通用格式时,将处于通用格式的神经网络模型转换为指定框架的神经网络模型,以应用于所述码流的解码。
在一些实施例中,所述通用格式包括NNFF或ONNX。当然,所述通用格式还可以是其他通用格式,以实现将所述神经网络模型在不同的深度学习框架之间通用,本公开对此不做限制。
为了降低表征所述神经网络模型的参数集的信息的复杂度,节省存储资源以及传输的带宽资源开销,在一些实施例中,编码端会对所述表征所述神经网络的参数集的信息进行压缩。相应地,所述视频解码方法还包括:所述压缩使能标识指示对神经网络模型的参数集进行压缩时,对压缩的神经网络模型的参数集进行解压。
在一些实施例中,所述所述解压为基于与NNR的压缩框架或AITISA的压缩框架中的压缩技术对应的解压技术进行的解压处理。当然,本领域技术人员应当理解,所述解压还可以通过其他对神经网络模型的解压技术实现,本公开对此不作限制。
在一些实施例中,编码端可以基于神经网络的技术对视频数据进行编码。相应地,所述视频解码方法还可以包括以下至少一个步骤:
利用神经网络模型进行解码处理为基于神经网络的环内滤波技术,所述处理时序信息为指示在预测重建之后的信息;
利用神经网络模型进行解码处理为基于神经网络的帧内预测技术,所述处理时序信息为指示在预测重建的帧内预测的信息;
利用神经网络模型进行解码处理为基于神经网络的图像超分辨率技术,用于进行帧间运动估计,所述处理时序信息为指示在预测重建的帧间预测阶段的信息;
利用神经网络模型进行解码处理为基于神经网络的图像超分辨率技术,用于获取重建图像,所述处理时序信息为指示在预测重建之后的信息;
利用神经网络模型进行解码处理为基于神经网络的上下文概率估计技术,所述处理时序信息为指示在熵解码中的信息;等等。
本领域技术人员应当理解,上述解码处理中,可以利用的具体的神经网络模型,仅仅为示例性说明,并非枚举,对码流的解码处理,还可以包括其他基于神经网络的技术,相应地,所述处理时序信息也可以是其他内容的处理时序信息,本公开对此不做限制。
在一些实施例中,所述解码处理基于视频解码器实现。所述解码器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序。
可见,由于在码流中携带的语法元素可以不依赖于现有的视频编码标准而存在于码流的预留字段中,进而使得基于神经网络的视频解码技术能够独立实现;所述语法元素也可以位于现有的多种编码标准的特定字段之中或者引用预先定义的独立视频编码标准,进而与现有的视频编码标准相兼容。通过本公开实施例的方法,能够降低基于神经网络的智能视频解码技术与特定视频编码标准的耦合性,扩大基于神经网络的视频解码技术的适用范围。
相应地,本公开还提供了一种视频编码器,其结构示意图如图14所示,所述编码器包括:存储器1401、处理器1402及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:
对视频数据进行编码处理,所述编码处理包括利用神经网络模型的编码处理;
基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络模型的参数集的信息。
所述视频编码具体实现方法如前文所述,本公开不再赘述。本领域所属技术人员应当理解,所述编码器还可以用于实现本公开前文所述的各个视频编码实施例。
相应地,本公开还提供了一种视频解码器,其结构示意图也可以如图14所示,所述解码器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程 序,所述处理器执行所述程序时实现以下方法:
对所接收的视频码流进行解析,获取所述视频码流的语法元素,所述语法元素包含表征神经网络模型的参数集的信息;
根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码处理。
所述视频解码具体实现方法如前文所述,本公开不再赘述。本领域所属技术人员应当理解,所述解码器还可以用于实现本公开前文所述的各个视频解码实施例。
相应地,本公开还提供了一种用于视频编码的AI加速器,其结构示意图也可以如图14所示,所述AI加速器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:
对视频数据进行编码处理;
将神经网络的参数集发送给视频编码器,以使所述视频编码器基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络的参数集的信息。
所述方法的具体实现如前文所述,本公开不再赘述。本领域所属技术人员应当理解,所述AI加速器还可以用于实现本公开前文所述的各个视频编码实施例。
相应地,本公开还提供了一种用于视频解码的AI加速器,其结构示意图也可以如图14所示,所述AI加速器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:
获取视频解码器对视频码流解析后获得的语法元素,所述语法元素包含表征神经网络的参数集的信息;
根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码。
所述方法的具体实现如前文所述,本公开不再赘述。本领域所属技术人员应当理解,所述AI加速器还可以用于实现本公开前文所述的各个视频解码实施例。
本公开实施例还提供一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得所述计算机执行前述任一实施例所述的方法。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方 法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本说明书实施例可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本说明书实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本说明书实施例各个实施例或者实施例的某些部分所述的方法。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。
以上实施例中的各种技术特征可以任意进行组合,只要特征之间的组合不存在冲突或矛盾,但是限于篇幅,未进行一一描述,因此上述实施方式中的各种技术特征的任意进行组合也属于本公开的范围。

Claims (93)

  1. 一种视频编码方法,其特征在于,所述方法包括:
    对视频数据进行编码处理,所述编码处理包括利用神经网络模型的编码处理;
    基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络模型的参数集的信息。
  2. 根据权利要求1所述的方法,其特征在于,所述语法元素位于所述码流的指定位置,所述指定位置为所述码流的预留字段。
  3. 根据权利要求2所述的方法,其特征在于,所述预留字段位于所述码流的数据包的包头。
  4. 根据权利要求1所述的方法,其特征在于,所述神经网络模型的参数集包括以下至少一个参数:所述神经网络的输入参数、层数、权重参数、超参数、每层的节点数量、激活函数。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:将所述神经网络模型转换为通用格式。
  6. 根据权利要求5所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集转换为通用格式后对应的信息。
  7. 根据权利要求5所述的方法,其特征在于,所述语法元素还包含格式转换使能标识,用于指示将所述神经网络模型转换为所述通用格式。
  8. 根据权利要求1所述的方法,其特征在于,所述方法还包括:对所述表征所述神经网络模型的参数集的信息进行压缩。
  9. 根据权利要求8所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集压缩后对应的信息。
  10. 根据权利要求8所述的方法,其特征在于,所述语法元素还包含压缩使能标识,用于指示对神经网络模型的参数集进行压缩。
  11. 根据权利要求1所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集转换为通用格式并进行压缩后对应的信息。
  12. 根据权利要求5至7、11任一所述的方法,其特征在于,所述通用格式包括NNFF或ONNX。
  13. 根据权利要求8至11任一所述的方法,其特征在于,所述压缩为基于NNR的压缩框架或AITISA的压缩框架中的压缩技术进行的压缩处理。
  14. 根据权利要求1所述的方法,其特征在于,所述语法元素还包括利用神经网络模型进行编码处理的使能标识,用于确定所述编码处理是否使用所述神经网络模型。
  15. 根据权利要求1所述的方法,其特征在于,所述语法元素还包括所述神经网络模型在所述编码处理中的处理时序信息,所述处理时序信息用于指示所述神经网络模型在编码处理过程中的具体位置。
  16. 根据权利要求15所述的方法,其特征在于,所述处理时序信息至少包括以下任一信息:
    指示在预测编码中的信息、指示在变换编码中的信息、指示在量化中的信息、指示在熵编码中的信息、指示在预测编码之前的信息、指示在预测编码和变换编码之间的信息、指示在变换编码和量化之间的信息、指示在量化和熵编码之间的信息、指示在熵编码之后的信息。
  17. 根据权利要求16所述的方法,其特征在于,所述利用神经网络模型的编码处理为基于神经网络的环内滤波技术,所述处理时序信息为指示在在预测重建之后的信息;
    和\或,
    所述利用神经网络模型的编码处理为基于神经网络的帧内预测技术,所述处理时序信息为指示在预测编码的帧内预测阶段的信息;
    和\或,
    所述利用神经网络模型的编码处理为基于神经网络的图像超分辨率技术,用于进行帧间运动估计,所述处理时序信息为指示在预测编码的帧间预测阶段的信息;
    和\或,
    所述利用神经网络模型的编码处理为基于神经网络的图像超分辨率技术,用于获取重建图像,所述处理时序信息为指示在熵编码之后的信息;
    和\或,
    所述利用神经网络模型的编码处理为基于神经网络的上下文概率估计技术,所述处理时序信息为指示在熵编码中的信息。
  18. 根据权利要求1所述的方法,其特征在于,所述语法元素还包括所述神经网络模型的标识信息。
  19. 根据权利要求1所述的方法,其特征在于,所述语法元素还包括所述神经网络模型的框架信息,所述框架信息用于指示所述神经网络模型所使用的框架。
  20. 根据权利要求19所述的方法,其特征在于,所述神经网络模型的框架包括: TensorFlow或Pytorch或Caffe2或AI硬件加速器。
  21. 根据权利要求1所述的方法,其特征在于,所述编码处理还包括:
    确定神经网络框架和视频编码器;
    基于所述神经网络框架和视频编码器进行训练,获得所述神经网络模型。
  22. 根据权利要求2所述的方法,其特征在于,所述码流基于指定的编码标准生成,所述预留字段位于所述码流的视频参数集,和\或序列参数集,和\或神经网络模型参数集,和\或图像参数集,和\或条带头,和\或辅助增强信息,和\或语法元素参数集的扩展数据,和\或用户数据,和\或打开比特流单元,和\或序列头,和\或图像头,和\或图像组头,和\或条带头,和\或宏块信息。
  23. 一种基于神经网络的视频编码方法,其特征在于,所述方法包括:
    利用神经网络模型对视频数据进行编码处理;
    将所述神经网络模型的参数集发送给视频编码器,以使所述视频编码器基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络模型的参数集的信息。
  24. 根据权利要求23所述的方法,其特征在于,所述神经网络模型的参数集包括以下至少一个参数:所述神经网络的输入参数、层数、权重参数、超参数、每层的节点数量、激活函数。
  25. 根据权利要求23所述的方法,其特征在于,所述利用神经网络模型对视频数据进行编码处理,包括:
    在预测编码的帧内预测阶段,执行基于神经网络的帧内预测技术;
    和\或,
    在预测编码的帧间预测阶段,执行基于神经网络的图像超分辨率技术,用于进行帧间运动估计;
    和\或,
    在预测重建之后,执行基于神经网络的环内滤波技术;
    和\或,
    在熵编码之后,执行基于神经网络的图像超分辨率技术,用于获取重建图像;
    和\或,
    在熵编码中,执行基于神经网络的上下文概率估计技术。
  26. 根据权利要求23所述的方法,其特征在于,所述语法元素位于所述码流的指定位置,所述指定位置为所述码流的预留字段。
  27. 根据权利要求26所述的方法,其特征在于,所述预留字段位于所述码流的数据包的包头。
  28. 根据权利要求23所述的方法,其特征在于,所述方法还包括:将所述神经网络模型转换为通用格式。
  29. 根据权利要求28所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集转换为通用格式后对应的信息。
  30. 根据权利要求28所述的方法,其特征在于,所述语法元素还包含格式转换使能标识,用于指示将所述神经网络模型转换为所述通用格式。
  31. 根据权利要求23所述的方法,其特征在于,所述方法还包括:对所述表征所述神经网络模型的参数集的信息进行压缩。
  32. 根据权利要求31所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集压缩后对应的信息。
  33. 根据权利要求31所述的方法,其特征在于,所述语法元素还包含压缩使能标识,用于指示对神经网络模型的参数集进行压缩。
  34. 根据权利要求23所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集转换为通用格式并进行压缩后对应的信息。
  35. 根据权利要求29至30、34任一所述的方法,其特征在于,所述通用格式包括NNFF或ONNX。
  36. 根据权利要求31至34任一所述的方法,其特征在于,所述压缩为基于NNR的压缩框架或AITISA的压缩框架中的压缩技术进行的压缩处理。
  37. 根据权利要求23所述的方法,其特征在于,所述语法元素还包括利用神经网络模型进行编码处理的使能标识,用于确定所述编码处理是否使用所述神经网络。
  38. 根据权利要求23所述的方法,其特征在于,所述语法元素还包括所述神经网络模型在所述编码处理中的处理时序信息,所述处理时序信息用于指示所述神经网络模型在编码处理过程中的具体位置。
  39. 根据权利要求38所述的方法,其特征在于,所述处理时序信息至少包括以下任一信息:
    指示在预测编码中的信息、指示在变换编码中的信息、指示在量化中的信息、指示在熵编码中的信息、指示在预测编码之前的信息、指示在预测编码和变换编码之间的信息、指示在变换编码和量化之间的信息、指示在量化和熵编码之间的信息、指示 在熵编码之后的信息。
  40. 根据权利要求23所述的方法,其特征在于,所述语法元素还包括所述神经网络模型的标识信息。
  41. 根据权利要求23所述的方法,其特征在于,所述语法元素还包括所述神经网络模型的框架信息,所述框架信息用于指示所述神经网络模型所使用的框架。
  42. 根据权利要求41所述的方法,其特征在于,所述神经网络模型的框架包括:TensorFlow或Pytorch或Caffe2或AI硬件加速器。
  43. 根据权利要求26所述的方法,其特征在于,所述神经网络模型通过以下方式确定:
    确定神经网络框架和视频编码器;
    基于所述神经网络框架和视频编码器进行训练,获得所述神经网络模型。
  44. 根据权利要求30所述的方法,其特征在于,所述码流基于指定的编码标准生成,所述预留字段位于所述码流的视频参数集,和\或序列参数集,和\或神经网络模型参数集,和\或图像参数集,和\或条带头,和\或辅助增强信息,和\或语法元素参数集的扩展数据,和\或用户数据,和\或打开比特流单元,和\或序列头,和\或图像头,和\或图像组头,和\或条带头,和\或宏块信息。
  45. 一种视频解码方法,其特征在于,所述方法包括:
    对所接收的视频码流进行解析,获取所述视频码流的语法元素,所述语法元素包含表征神经网络模型的参数集的信息;
    根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码处理。
  46. 根据权利要求45所述的方法,其特征在于,所述语法元素位于所述码流的指定位置,所述指定位置为所述码流的预留字段。
  47. 根据权利要求46所述的方法,其特征在于,所述预留字段位于所述码流的数据包的包头。
  48. 根据权利要求45所述的方法,其特征在于,所述神经网络模型的参数集包括以下至少一个参数:所述神经网络的输入参数、层数、权重参数、超参数、每层的节点数量、激活函数。
  49. 根据权利要求45所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集转换为通用格式后对应的信息。
  50. 根据权利要求49所述的方法,其特征在于,所述语法元素还包含格式转换使 能标识,用于指示将所述神经网络模型转换为所述通用格式。
  51. 根据权利要求50所述的方法,其特征在于,所述方法还包括:
    所述格式转换使能标识指示神经网络模型转换为所述通用格式时,将处于通用格式的神经网络模型转换为指定框架的神经网络模型。
  52. 根据权利要求51所述的方法,其特征在于,所述语法元素还包括表征所述神经网络模型的框架的信息。
  53. 根据权利要求51所述的方法,其特征在于,所述神经网络模型的框架包括:TensorFlow或Pytorch或Caffe2或AI硬件加速器。
  54. 根据权利要求45所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集压缩后对应的信息。
  55. 根据权利要求54所述的方法,其特征在于,所述语法元素还包含压缩使能标识,用于对神经网络模型的参数集进行压缩。
  56. 根据权利要求55所述的方法,其特征在于,所述方法还包括:
    所述压缩使能标识指示对神经网络模型的参数集进行压缩时,对压缩的神经网络模型的参数集进行解压。
  57. 根据权利要求45所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集转换为通用格式并进行压缩后对应的信息。
  58. 根据权利要求49至51、57任一所述的方法,其特征在于,所述通用格式包括NNFF或ONNX。
  59. 根据权利要求52至57任一所述的方法,其特征在于,所述解压为基于与NNR的压缩框架或AITISA的压缩框架中的压缩技术对应的解压技术进行的解压处理。
  60. 根据权利要求45所述的方法,其特征在于,所述语法元素还包括利用神经网络模型进行编码处理的使能标识,用于确定所述编码处理是否使用所述神经网络。
  61. 根据权利要求45所述的方法,其特征在于,所述语法元素还包括所述神经网络模型在所述编码处理中的处理时序信息,所述处理时序信息用于指示所述神经网络模型在编码处理过程中的具体位置。
  62. 根据权利要求45所述的方法,其特征在于,所述处理时序信息至少包括以下任一信息:
    指示在预测编码中的信息、指示在变换编码中的信息、指示在量化中的信息、指示在熵编码中的信息、指示在预测编码之前的信息、指示在预测编码和变换编码之间 的信息、指示在变换编码和量化之间的信息、指示在量化和熵编码之间的信息、指示在熵编码之后的信息。
  63. 根据权利要求45所述的方法,其特征在于,所述语法元素还包括所述神经网络模型的标识信息。
  64. 根据权利要求62所述的方法,其特征在于,利用神经网络模型进行解码处理为基于神经网络的环内滤波技术,所述处理时序信息为指示在预测重建之后的信息;
    和\或,
    利用神经网络模型进行解码处理为基于神经网络的帧内预测技术,所述处理时序信息为指示在预测重建的帧内预测的信息;
    和\或,
    利用神经网络模型进行解码处理为基于神经网络的图像超分辨率技术,用于进行帧间运动估计,所述处理时序信息为指示在预测重建的帧间预测阶段的信息;
    和\或,
    利用神经网络模型进行解码处理为基于神经网络的图像超分辨率技术,用于获取重建图像,所述处理时序信息为指示在预测重建之后的信息;
    和\或,
    利用神经网络模型进行解码处理为基于神经网络的上下文概率估计技术,所述处理时序信息为指示在熵解码中的信息。
  65. 根据权利要求45所述的方法,其特征在于,所述解码处理基于视频解码器实现。
  66. 根据权利要求46所述的方法,其特征在于,所述码流基于指定的编码标准生成,所述预留字段位于所述码流的视频参数集,和\或序列参数集,和\或神经网络模型参数集,和\或图像参数集,和\或条带头,和\或辅助增强信息,和\或语法元素参数集的扩展数据,和\或用户数据,和\或打开比特流单元,和\或序列头,和\或图像头,和\或图像组头,和\或条带头,和\或宏块信息。
  67. 一种基于神经网络的视频解码方法,其特征在于,所述方法包括:
    获取视频解码器对视频码流解析后获得的语法元素,所述语法元素包含表征神经网络模型的参数集的信息;
    根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码处理。
  68. 根据权利要求67所述的方法,其特征在于,所述神经网络模型的参数集包括 以下至少一个参数:所述神经网络的输入参数、层数、权重参数、超参数、每层的节点数量、激活函数。
  69. 根据权利要求67所述的方法,其特征在于,所述利用神经网络模型对视频数据进行解码处理,包括:
    在预测重建的帧内预测阶段,执行基于神经网络的帧内预测技术;
    和\或,
    在预测重建的帧间预测阶段,执行基于神经网络的图像超分辨率技术,用于进行帧间运动估计;
    和\或,
    在预测重建之后,执行基于神经网络的环内滤波技术;
    和\或,
    在预测重建之后,执行基于神经网络的图像超分辨率技术,用于获取重建图像;
    和\或,
    在熵解码中,执行基于神经网络的上下文概率估计技术。
  70. 根据权利要求67所述的方法,其特征在于,所述语法元素位于所述码流的指定位置,所述指定位置为所述码流的预留字段。
  71. 根据权利要求70所述的方法,其特征在于,所述预留字段位于所述码流的数据包的包头。
  72. 根据权利要求67所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集转换为通用格式后对应的信息。
  73. 根据权利要求72所述的方法,其特征在于,所述语法元素还包含格式转换使能标识,用于指示将所述神经网络模型转换为所述通用格式。
  74. 根据权利要求73所述的方法,其特征在于,所述方法还包括:
    所述格式转换使能标识指示神经网络模型转换为所述通用格式时,将处于通用格式的神经网络模型转换为指定框架的神经网络模型。
  75. 根据权利要求74所述的方法,其特征在于,所述语法元素还包括表征所述神经网络模型的框架的信息。
  76. 根据权利要求75所述的方法,其特征在于,神经网络模型的框架包括:TensorFlow或Pytorch或Caffe2或AI硬件加速器。
  77. 根据权利要求67所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集压缩后对应的信息。
  78. 根据权利要求77所述的方法,其特征在于,所述语法元素还包含压缩使能标识,用于指示对神经网络模型的参数集进行压缩。
  79. 根据权利要求78所述的方法,其特征在于,所述方法还包括:
    所述压缩使能标识指示对神经网络模型的参数集进行压缩时,对压缩的神经网络模型的参数集进行解压。
  80. 根据权利要求67所述的方法,其特征在于,所述表征所述神经网络模型的参数集的信息为将所述神经网络模型的参数集转换为通用格式并进行压缩后对应的信息。
  81. 根据权利要求72至74、80任一所述的方法,其特征在于,所述通用格式包括NNFF或ONNX。
  82. 根据权利要求77至80任一所述的方法,其特征在于,所述解压为基于与NNR的压缩框架或AITISA的压缩框架中的压缩技术对应的解压技术进行的解压处理。
  83. 根据权利要求67所述的方法,其特征在于,所述语法元素还包括利用神经网络模型进行编码处理的使能标识,用于确定所述编码处理是否使用所述神经网络。
  84. 根据权利要求67所述的方法,其特征在于,所述语法元素还包括所述神经网络模型在所述编码处理中的处理时序信息,所述处理时序信息用于指示所述神经网络模型在编码处理过程中的具体位置。
  85. 根据权利要求84所述的方法,其特征在于,所述处理时序信息至少包括以下任一信息:
    指示在预测编码中的信息、指示在变换编码中的信息、指示在量化中的信息、指示在熵编码中的信息、指示在预测编码之前的信息、指示在预测编码和变换编码之间的信息、指示在变换编码和量化之间的信息、指示在量化和熵编码之间的信息、指示在熵编码之后的信息。
  86. 根据权利要求67所述的方法,其特征在于,所述语法元素还包括所述神经网络模型的标识信息。
  87. 根据权利要求67所述的方法,其特征在于,所述解码处理基于视频解码器实现。
  88. 根据权利要求70所述的方法,其特征在于,所述码流基于指定的编码标准生成,所述预留字段位于所述码流的视频参数集,和\或序列参数集,和\或神经网络模型参数集,和\或图像参数集,和\或条带头,和\或辅助增强信息,和\或语法元素参数集的扩展数据,和\或用户数据,和\或打开比特流单元,和\或序列头,和\或图像头,和 \或图像组头,和\或条带头,和\或宏块信息。
  89. 一种视频编码器,其特征在于,所述编码器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:
    对视频数据进行编码处理,所述编码处理包括利用神经网络模型的编码处理;
    基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络模型的参数集的信息。
  90. 一种视频解码器,其特征在于,所述解码器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:
    对所接收的视频码流进行解析,获取所述视频码流的语法元素,所述语法元素包含表征神经网络模型的参数集的信息;
    根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码处理。
  91. 一种用于视频编码的AI加速器,其特征在于,所述AI加速器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现以下方法:
    对视频数据进行编码处理;
    将神经网络的参数集发送给视频编码器,以使所述视频编码器基于编码处理后的视频数据,生成携带有语法元素的码流,所述语法元素包含表征所述神经网络的参数集的信息。
  92. 一种用于视频解码的AI加速器,其特征在于,所述AI加速器包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下方法:
    获取视频解码器对视频码流解析后获得的语法元素,所述语法元素包含表征神经网络的参数集的信息;
    根据所述语法元素,利用与所述参数集对应的神经网络模型对所述视频码流进行解码。
  93. 一种机器可读存储介质,其特征在于,所述机器可读存储介质上存储有若干计算机指令,所述计算机指令被执行时进行权利要求1、21、45以及67任一所述的方法。
PCT/CN2020/133979 2020-12-04 2020-12-04 视频编码方法、解码方法、编码器、解码器以及ai加速器 WO2022116165A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/133979 WO2022116165A1 (zh) 2020-12-04 2020-12-04 视频编码方法、解码方法、编码器、解码器以及ai加速器
CN202080081315.7A CN114868390A (zh) 2020-12-04 2020-12-04 视频编码方法、解码方法、编码器、解码器以及ai加速器

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/133979 WO2022116165A1 (zh) 2020-12-04 2020-12-04 视频编码方法、解码方法、编码器、解码器以及ai加速器

Publications (1)

Publication Number Publication Date
WO2022116165A1 true WO2022116165A1 (zh) 2022-06-09

Family

ID=81852780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/133979 WO2022116165A1 (zh) 2020-12-04 2020-12-04 视频编码方法、解码方法、编码器、解码器以及ai加速器

Country Status (2)

Country Link
CN (1) CN114868390A (zh)
WO (1) WO2022116165A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116781910A (zh) * 2023-07-03 2023-09-19 江苏汇智达信息科技有限公司 基于神经网络算法的信息转换系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190230354A1 (en) * 2016-06-24 2019-07-25 Korea Advanced Institute Of Science And Technology Encoding and decoding methods and devices including cnn-based in-loop filter
CN110648278A (zh) * 2019-09-10 2020-01-03 网宿科技股份有限公司 一种图像的超分辨率处理方法、系统及设备
CN111064958A (zh) * 2019-12-28 2020-04-24 复旦大学 一种针对b帧和p帧的低复杂度神经网络滤波算法
CN111937392A (zh) * 2018-04-17 2020-11-13 联发科技股份有限公司 视频编解码的神经网络方法和装置
CN112019843A (zh) * 2019-05-30 2020-12-01 富士通株式会社 编码和解码程序、编码和解码设备、编码和解码方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3562162A1 (en) * 2018-04-27 2019-10-30 InterDigital VC Holdings, Inc. Method and apparatus for video encoding and decoding based on neural network implementation of cabac
US10499081B1 (en) * 2018-06-19 2019-12-03 Sony Interactive Entertainment Inc. Neural network powered codec
CN110401834B (zh) * 2019-08-06 2021-07-27 杭州微帧信息科技有限公司 一种基于深度学习的自适应视频编码方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190230354A1 (en) * 2016-06-24 2019-07-25 Korea Advanced Institute Of Science And Technology Encoding and decoding methods and devices including cnn-based in-loop filter
CN111937392A (zh) * 2018-04-17 2020-11-13 联发科技股份有限公司 视频编解码的神经网络方法和装置
CN112019843A (zh) * 2019-05-30 2020-12-01 富士通株式会社 编码和解码程序、编码和解码设备、编码和解码方法
CN110648278A (zh) * 2019-09-10 2020-01-03 网宿科技股份有限公司 一种图像的超分辨率处理方法、系统及设备
CN111064958A (zh) * 2019-12-28 2020-04-24 复旦大学 一种针对b帧和p帧的低复杂度神经网络滤波算法

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
H. YIN (INTEL), R. YANG (INTEL), X. FANG, S. MA, Y. YU (INTEL): "AHG9 : Adaptive convolutional neural network loop filter", 13. JVET MEETING; 20190109 - 20190118; MARRAKECH; (THE JOINT VIDEO EXPLORATION TEAM OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ), 5 January 2019 (2019-01-05), Marrakech MA, pages 1 - 2, XP030200692 *
Y. LI, S. LIU (TENCENT), K. KAWAMURA (KDDI): "CE10: Summary Report on Neural Network based Filter for Video Coding", 15. JVET MEETING; 20190703 - 20190712; GOTHENBURG; (THE JOINT VIDEO EXPLORATION TEAM OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ), 9 July 2019 (2019-07-09), Gothenburg SE, pages 1 - 10, XP030218520 *
Y. LI, S. LIU (TENCENT), K. KAWAMURA (KDDI): "Description of Core Experiment 10 (CE10): Neural Network based Filter for Video Coding", 14. JVET MEETING; 20190319 - 20190327; GENEVA; (THE JOINT VIDEO EXPLORATION TEAM OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ), 18 April 2019 (2019-04-18), Geneva CH, pages 1 - 11, XP030205178 *
Y. LI, S. LIU (TENCENT), K. KAWAMURA (KDDI): "Description of Core Experiment 13 (CE13): Neural Network based Filter for Video Coding", 13. JVET MEETING; 20190109 - 20190118; MARRAKECH; (THE JOINT VIDEO EXPLORATION TEAM OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ), 11 February 2019 (2019-02-11), Marrakech MA, pages 1 - 10, XP030202577 *
Y.-L. HSIAO, T.-D. CHUANG, C.-Y. CHEN, C.-W. HSU, Y.-W. HUANG, S.-M. LEI (MEDIATEK): "AHG9: Convolution neural network loop filter", 11. JVET MEETING; 20180711 - 20180718; LJUBLJANA; (THE JOINT VIDEO EXPLORATION TEAM OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ), 3 July 2018 (2018-07-03), Ljubljana SI, pages 1 - 4, XP030198827 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116781910A (zh) * 2023-07-03 2023-09-19 江苏汇智达信息科技有限公司 基于神经网络算法的信息转换系统

Also Published As

Publication number Publication date
CN114868390A (zh) 2022-08-05

Similar Documents

Publication Publication Date Title
US11523124B2 (en) Coded-block-flag coding and derivation
CN108696761B (zh) 一种图片文件处理方法及其设备、系统
KR101365441B1 (ko) 영상 부호화장치 및 방법과 그 영상 복호화장치 및 방법
US11451827B2 (en) Non-transform coding
US20220046260A1 (en) Video decoding method and apparatus, video encoding method and apparatus, device, and storage medium
CN108353175B (zh) 使用系数引起的预测处理视频信号的方法和装置
US11477465B2 (en) Colour component prediction method, encoder, decoder, and storage medium
FR2888424A1 (fr) Dispositif et procede de codage et de decodage de donnees video et train de donnees
CN109151503B (zh) 一种图片文件处理方法及其设备
US20220046253A1 (en) Video encoding and decoding methods and apparatuses, device, and storage medium
WO2022116165A1 (zh) 视频编码方法、解码方法、编码器、解码器以及ai加速器
US20240080487A1 (en) Method, apparatus for processing media data, computer device and storage medium
TWI797560B (zh) 跨層參考限制條件
WO2024078066A1 (zh) 视频解码方法、视频编码方法、装置、存储介质及设备
KR20140119220A (ko) 영상 재압축 제공 장치 및 방법
CN111147858B (zh) 视频解码方法、装置、设备及存储介质
WO2024061136A1 (en) Method, apparatus, and medium for video processing
WO2023004590A1 (zh) 一种视频解码、编码方法及设备、存储介质
US20240205413A1 (en) Picture encoding method and apparatus, picture decoding method and apparatus, electronic device and storage medium
WO2023148084A1 (en) A method and an apparatus for encoding/decoding attributes of a 3d object
CN116033170A (zh) 视频解码方法、视频编解码系统以及视频解码装置
CN113163212A (zh) 视频解码方法及装置、视频编码方法及装置、介质和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20964011

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20964011

Country of ref document: EP

Kind code of ref document: A1