WO2022100173A1 - 一种视频帧的压缩和视频帧的解压缩方法及装置 - Google Patents

一种视频帧的压缩和视频帧的解压缩方法及装置 Download PDF

Info

Publication number
WO2022100173A1
WO2022100173A1 PCT/CN2021/112077 CN2021112077W WO2022100173A1 WO 2022100173 A1 WO2022100173 A1 WO 2022100173A1 CN 2021112077 W CN2021112077 W CN 2021112077W WO 2022100173 A1 WO2022100173 A1 WO 2022100173A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
current video
neural network
frame
feature
Prior art date
Application number
PCT/CN2021/112077
Other languages
English (en)
French (fr)
Inventor
师一博
王晶
葛运英
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21890702.0A priority Critical patent/EP4231644A4/en
Priority to JP2023528362A priority patent/JP2023549210A/ja
Priority to CN202180076647.0A priority patent/CN116918329A/zh
Publication of WO2022100173A1 publication Critical patent/WO2022100173A1/zh
Priority to US18/316,750 priority patent/US20230281881A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/61Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/103Selection of coding mode or of prediction mode
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a method and apparatus for compressing video frames and decompressing video frames.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • video frame compression based on deep learning (deep learning) neural network is a common application of artificial intelligence.
  • the encoder calculates the optical flow of the current video frame relative to the reference frame of the current video frame through the neural network, generates the optical flow of the original current video frame relative to the reference frame, compresses and encodes the aforementioned optical flow, and obtains the compressed
  • the reference frame of the current video frame and the current video frame belong to the current video sequence, and the reference frame of the current video frame is the video frame that needs to be referenced when the current video frame is compressed and encoded.
  • Decompress the compressed optical flow to obtain the decompressed optical flow generate the predicted current video frame according to the decompressed optical flow and the reference frame, and calculate the original current video frame and the predicted current video through the neural network Residuals between frames, the aforementioned residuals are compressed and encoded.
  • the compressed optical flow and the compressed residual are sent to the decoder, so that the decoder can obtain the decompressed information through the neural network according to the decompressed reference frame, the decompressed optical flow and the decompressed residual. of the current video frame.
  • the present application provides a video frame compression and video frame decompression method and device.
  • the quality of the reconstructed frame of the current video frame will not depend on the reference frame of the current video frame.
  • the quality of the reconstructed frame thereby avoiding the accumulation of errors between frames, so as to improve the quality of the reconstructed frame of the video frame; in addition, the advantages of the first neural network and the second neural network are integrated to achieve the maximum reduction in the data that needs to be transmitted.
  • the quality of the reconstructed frame of the video frame is improved on the basis of the amount.
  • the present application provides a video frame compression method, which can apply artificial intelligence technology to the field of video frame encoding and decoding.
  • the method may include: an encoder determining a target neural network from a plurality of neural networks according to a network selection strategy, the plurality of neural networks including a first neural network and a second neural network; compressing and encoding the current video frame through the target neural network to obtain Compression information corresponding to the current video frame.
  • the compression information includes the first compression information of the first feature of the current video frame
  • the reference frame of the current video frame is used for the compression process of the first feature of the current video frame
  • the current The reference frame of the video frame is not used in the generation process of the first feature of the current video frame; that is, the first feature of the current video frame can be obtained only based on the current video frame, and the first feature of the current video frame is not generated during the generation process.
  • the reference frame of the current video frame is required. If the compression information is obtained through the second neural network, the compression information includes the second compression information of the second feature of the current video frame, and the reference frame of the current video frame is used for the generation process of the second feature of the current video frame.
  • the current video frame is the original video frame included in the current video sequence; the reference frame of the current video frame may be the original video frame in the current video sequence, or may not be the original video frame in the current video sequence.
  • the reference frame of the current video frame may be a video frame obtained by transform-coding the original reference frame through an encoding network, and then performing inverse transformation and decoding through a decoding network; or, the reference frame of the current video frame is the original reference frame of the encoder.
  • the compression information when the compression information is obtained through the first neural network, the compression information carries the compression information of the first feature of the current video frame, and the reference frame of the current video frame is only used for the first feature of the current video frame
  • the compression process is not used for the generation process of the first feature of the current video frame, so that the decoder does not need to use the reference frame of the current video frame after performing the decompression operation according to the first compression information to obtain the first feature of the current video frame.
  • the reconstructed frame of the current video frame can be obtained, so when the compression information is obtained through the first neural network, the quality of the reconstructed frame of the current video frame will not depend on the quality of the reconstructed frame of the reference frame of the current video frame, thereby avoiding The error is accumulated between frames to improve the quality of the reconstructed frame of the video frame; in addition, since the second feature of the current video frame is generated according to the reference frame of the current video frame, the second compression information of the second feature corresponds to The amount of data is smaller than the amount of data corresponding to the first compressed information of the first feature.
  • the encoder can use the first neural network and the second neural network to process different video frames in the current video sequence to synthesize the first neural network.
  • the advantage of the neural network and the second neural network is to improve the quality of the reconstructed frame of the video frame on the basis of minimizing the amount of data to be transmitted.
  • the first neural network includes an encoding (Encoding) network and an entropy encoding layer, wherein the first feature of the current video frame is obtained from the current video frame through the encoding network; the entropy encoding layer is used to obtain the first feature of the current video frame; Entropy encoding is performed on the first feature of the current video frame to output first compression information. Further, the first feature of the current video frame is obtained by performing transform coding on the current video frame through the first coding network, and then performing quantization after the transform coding is performed.
  • the second neural network includes a convolutional network and an entropy coding layer
  • the convolutional network includes a plurality of convolutional layers and an excitation ReLU layer
  • the current video is utilized through the convolutional network
  • the reference frame of the frame obtains the residual of the current video frame, and entropy encoding is performed on the residual of the current video frame through the entropy encoding layer to output the second compression information.
  • the encoder compresses and encodes the current video frame through the target neural network to obtain compression information corresponding to the current video frame, which may include :
  • the encoder generates the optical flow of the original current video frame relative to the reference frame of the current video frame, and compresses and encodes the aforementioned optical flow to obtain the compressed optical flow, wherein the second feature of the current video frame includes the original current video.
  • the encoder can also decompress the compressed optical flow to obtain the decompressed optical flow, and generate the predicted current video frame according to the decompressed optical flow and the reference frame of the current video frame; Calculate the residual between the original current video frame and the predicted current video frame; wherein, the second feature of the current video frame includes the optical flow of the original current video frame relative to the reference frame of the current video frame and the original current video frame and the predicted residual between the current video frame.
  • the network selection strategy is related to any one or more of the following factors: location information of the current video frame or the amount of data carried by the current video frame.
  • the encoder determines the target neural network from multiple neural networks according to a network selection strategy, including: the encoder obtains position information of the current video frame in the current video sequence, wherein the position information It is used to indicate that the current video frame is the X th frame of the current video sequence, and the position information of the current video frame in the current video sequence can be specifically expressed as an index number, and the index number can be specifically expressed in the form of a character string.
  • the encoder selects the target neural network from multiple neural networks according to the location information.
  • the encoder determines the target neural network from multiple neural networks according to the network selection strategy, including: the encoder selects the target neural network from the multiple neural networks according to the attributes of the current video frame, wherein the attributes of the current video frame are used for Reflecting the amount of data carried by the current video frame, the attributes of the current video frame include any one or a combination of the following: entropy, contrast, and saturation of the current video frame.
  • the target neural network is selected from multiple neural networks according to the position information of the current video frame in the current video sequence; alternatively, the target neural network can be selected from multiple neural networks according to at least one attribute of the current video network, and then can use the target neural network to generate the compression information of the current video frame, provide a variety of simple and easy-to-operate implementation schemes, and improve the implementation flexibility of the scheme.
  • the method may further include: the encoder generating and sending at least one indication information corresponding to one or more pieces of compressed information one-to-one.
  • each indication information is used to indicate that a piece of compressed information is obtained through the target neural network in the first neural network and the second neural network, that is, the piece of indication information is used to indicate that a piece of compressed information is obtained through the first neural network and the second neural network. Which neural network in the neural network got it.
  • the decoder can obtain multiple pieces of indication information corresponding to multiple pieces of compression information, so that the decoder can know that each video frame in the current video sequence uses the first neural network and the second neural network.
  • Which neural network performs the decompression operation is beneficial to improve the time for the decoder to decode the compressed information, that is, to improve the efficiency of video frame transmission for the entire encoder and decoder.
  • the encoder compresses and encodes the current video frame through the target neural network to obtain compression information corresponding to the current video frame, which may include :
  • the encoder obtains the first feature of the current video frame from the current video frame through an encoding network; according to the reference frame of the current video frame, the encoder predicts the feature of the current video frame to generate the prediction of the current video frame. feature; wherein, the prediction feature of the current video frame is the prediction result of the first feature of the current video frame, and the prediction feature of the current video frame and the data shape of the first feature of the current video frame are the same.
  • the encoder generates the probability distribution of the first feature of the current video frame according to the prediction feature of the current video frame through the entropy coding layer; the probability distribution of the first feature of the current video frame includes the mean value of the first feature of the current video frame and the current video frame. The variance of the first feature of .
  • the encoder performs entropy coding on the first feature of the current video frame according to the probability distribution of the first feature of the current video frame through the entropy coding layer to obtain the first compression information.
  • the encoder because the encoder generates the probability distribution of the first feature of the current video frame according to the prediction feature of the current video frame, and then performs the first feature of the current video frame according to the probability distribution of the first feature of the current video frame.
  • Compression coding to obtain the first compression information of the current video frame because the higher the similarity between the prediction feature of the current video frame and the first feature, the greater the compression rate of the first feature, and the final obtained first
  • the compressed information will be smaller, and the prediction feature of the current video frame is obtained by predicting the feature of the current video frame according to the reference frame of the current video frame, so as to improve the prediction feature of the current video frame and the first feature of the current video frame. Therefore, the size of the compressed first compressed information can be reduced, that is, not only the quality of the reconstructed frame obtained by the decoder can be guaranteed, but also the size of the data transmitted by the encoder and the decoder can be reduced.
  • the first neural network and the second neural network are both neural networks that have undergone training operations, and the model parameters of the first neural network are based on the first loss function of the first neural network. updated.
  • the first loss function includes the loss item of the similarity between the first training video frame and the first training reconstruction frame and the loss item of the data size of the compression information of the first training video frame
  • the first training reconstruction frame is the first The reconstruction frame of the training video frame
  • the training target of the first loss function includes narrowing the similarity between the first training video frame and the first training reconstruction frame, and also includes reducing the size of the first compression information of the first training video frame .
  • the second loss function includes the second training video frames and the second training video frame.
  • the second training reconstruction frame is the reconstruction frame of the second training video frame
  • the reference frame of the second training video frame is the video frame processed by the first neural network
  • the training target of the second loss function includes zooming in the second training video frame.
  • the similarity between the video frame and the second training reconstruction frame further includes reducing the size of the second compression information of the second training video frame.
  • the reference frame used by the second neural network may be processed by the first neural network in the execution stage
  • the reference frame processed by the first neural network is used to perform training on the second neural network
  • the operation is conducive to maintaining the consistency between the training phase and the execution phase, so as to improve the accuracy of the execution phase.
  • the embodiments of the present application provide a video frame compression method, which can apply artificial intelligence technology to the field of video frame encoding and decoding.
  • the encoder compresses and encodes the current video frame through the first neural network to obtain the first compression information of the first feature of the current video frame, and the reference frame of the current video frame is used for the compression process of the first feature of the current video frame; by
  • the first neural network generates a first video frame, which is a reconstructed frame of the current video frame.
  • the encoder compresses and encodes the current video frame through the second neural network to obtain the second compression information of the second feature of the current video frame, and the reference frame of the current video frame is used for the generation process of the second feature of the current video frame; by The second neural network generates a second video frame, which is a reconstructed frame of the current video frame.
  • the encoder determines compression information corresponding to the current video frame according to the first compression information, the first video frame, the second compression information and the second video frame, wherein the determined compression information is obtained through the first neural network, and the determined compression information is The compressed information is the first compressed information; or, the determined compressed information is obtained through the second neural network, and the determined compressed information is the second compressed information.
  • the final required compression information is selected from the first compression information and the second compression information.
  • the performance of the compressed information corresponding to the entire current video sequence can be improved as much as possible.
  • the encoder may adopt the same selection method of target compression information. Specifically, the encoder calculates a first score value corresponding to the first compression information (that is, the first score value corresponding to the first neural network) according to the first compression information and the first video frame, and calculates a first score value corresponding to the first neural network according to the second compression information and For the second video frame, the second score value corresponding to the second compressed information (that is, the second score value corresponding to the second neural network) is calculated, and the encoder selects the lowest value from the first score value and the second score value
  • the score value of the video frame is determined from the first compression information and the second compression information, and the compression information corresponding to the score value with the lowest value is determined as the compression information of the current video frame, that is, the neural network corresponding to the score value with the lowest value is determined. for the target neural network.
  • the encoder first performs a compression operation on the current video frame through the first neural network and the second neural network, and obtains a first score corresponding to the first compression information value, and the second score value corresponding to the second compression information, and determine the score value with the lowest value from it, so as to make the score value of all video frames in the entire current video sequence as low as possible, so as to improve the total value of the entire current video sequence.
  • the performance of the corresponding compressed information is performed for each video frame in the current video sequence.
  • the encoder may use one cycle as a calculation unit, and generate a plurality of The value of the coefficient and the offset of the first fitting formula corresponding to the first score value; The values of the coefficients and offsets of the second fitting formula corresponding to the two score values.
  • the encoder determines the compression information of the current video frame from the first compression information and the second compression information according to the first fitting formula and the second fitting formula, wherein the optimization objective is to make the average value of the total score values in one cycle Minimum, that is, the optimization objective is to minimize the value of the total score value in one cycle.
  • the technician finds the change rule of the first score value and the second score value in a single cycle during the research, and takes the minimum average value of the total score value in a cycle as the optimization goal, that is, in the When determining the target compression information corresponding to each current video frame, not only the score value of the current video frame, but also the average value of the score value in the whole period should be considered, so as to further reduce the correlation with all video frames in the entire current video sequence.
  • the corresponding score value can further improve the performance of the compressed information corresponding to the entire current video sequence.
  • the encoder may also perform the steps performed by the encoder in each possible implementation manner of the first aspect.
  • the steps performed by the encoder in each possible implementation manner of the first aspect.
  • the embodiments of the present application provide a video frame compression method, which can apply artificial intelligence technology to the field of video frame encoding and decoding.
  • the method may include: the encoder compresses and encodes the third video frame through the first neural network to obtain first compression information corresponding to the third video frame, where the first compression information includes compression information of the first feature of the third video frame , the reference frame of the third video frame is used for the compression process of the first feature of the third video frame; the encoder compresses and encodes the fourth video frame through the second neural network to obtain the second compression corresponding to the fourth video frame.
  • the second compression information includes compression information of the second feature of the fourth video frame, and the reference frame of the fourth video frame is used for the generation process of the second feature of the fourth video frame.
  • the encoder may also perform the steps performed by the encoder in each possible implementation manner of the first aspect.
  • the encoder may also perform the steps performed by the encoder in each possible implementation manner of the first aspect.
  • an embodiment of the present application provides a video frame decompression method, which can apply artificial intelligence technology to the field of video frame encoding and decoding.
  • the decoder obtains the compression information of the current video frame, and performs a decompression operation through the target neural network according to the compression information of the current video frame, so as to obtain the reconstructed frame of the current video frame.
  • the target neural network is a neural network selected from a plurality of neural networks, and the plurality of neural networks includes a third neural network and a fourth neural network.
  • the compression information includes the first compression information of the first feature of the current video frame, and the reference frame of the current video frame is used for the decompression process of the first compression information, so as to obtain the first compression information of the current video frame.
  • the first feature, the first feature of the current video frame is used for the generation process of the reconstructed frame of the current video frame;
  • the compression information includes the second compression information of the second feature of the current video frame,
  • the second compression information is used for the decoder to perform a decompression operation to obtain the second feature of the current video frame.
  • the reference frame of the current video frame and the second feature of the current video frame are used for the generation process of the reconstructed frame of the current video frame.
  • the reconstructed frame of the video frame and the reference frame of the current video frame are included in the current video sequence.
  • the third neural network includes an entropy decoding layer and a decoding Decoding network, wherein the entropy decoding layer uses the reference frame of the current video frame to perform entropy decoding of the first compression information of the current video frame
  • a reconstructed frame of the current video frame is generated by using the first feature of the current video frame through the decoding network.
  • the decoder performs the decompression operation through the target neural network according to the compression information of the current video frame, so as to obtain the reconstructed frame of the current video frame, which may include: a decoder.
  • the probability distribution of the first feature is generated according to the predicted feature of the current video frame, wherein the predicted feature of the current video frame is obtained by predicting the first feature according to the reference frame of the current video frame.
  • the decoder performs entropy decoding on the compressed information according to the probability distribution of the first feature to obtain the first feature, and performs inverse transform decoding on the first feature to obtain the reconstructed frame of the current video frame.
  • the fourth neural network includes an entropy decoding layer and a convolutional network, wherein the entropy decoding is performed on the second compressed information through the entropy decoding layer, and the reference of the current video frame is used through the convolutional network.
  • the frame and the second feature of the current video frame perform a process of generating a reconstructed frame of the current video frame.
  • the decoder performs a decompression operation through the target neural network according to the compression information of the current video frame to obtain the reconstructed frame of the current video frame, which may include: a decoder.
  • the second compressed information is decompressed to obtain the second feature of the fourth video frame, that is, the optical flow of the original current video frame relative to the reference frame of the current video frame and the original current video frame and the predicted current frame are obtained. Residuals between video frames.
  • the encoder predicts the current video frame according to the optical flow of the original current video frame relative to the reference frame of the current video frame and the reference frame of the current video frame, and obtains the predicted current video frame;
  • the residual between the current video frame and the predicted current video frame generates a reconstructed frame of the current video frame.
  • the method may further include: the decoder obtains at least one indication information corresponding to the at least one compressed information; and according to the at least one indication information and the compression information of the current video frame, from including A target neural network corresponding to the current video frame is determined from among the plurality of neural networks of the third neural network and the fourth neural network.
  • the embodiments of the present application provide a video frame decompression method, which can apply artificial intelligence technology to the field of video frame encoding and decoding.
  • the decoder decompresses the first compressed information of the third video frame through the third neural network to obtain a reconstructed frame of the third video frame, where the first compressed information includes the compressed information of the first feature of the third video frame, and the third
  • the reference frame of the video frame is used in the decompression process of the first compressed information to obtain the first feature of the third video frame, and the first feature of the third video frame is used in the generation process of the reconstructed frame of the third video frame.
  • the decoder decompresses the second compression information of the fourth video frame through the fourth neural network to obtain the decompressed fourth video frame, the second compression information includes the compression information of the second feature of the fourth video frame, and the first The second compression information is used for the decoder to perform a decompression operation to obtain the second feature of the fourth video frame, and the reference frame of the fourth video frame and the second feature of the fourth video frame are used for the generation of the reconstructed frame of the fourth video frame process.
  • the decoder may also perform the steps performed by the decoder in each possible implementation manner of the fourth aspect.
  • the steps performed by the decoder in each possible implementation manner of the fourth aspect.
  • an embodiment of the present application provides an encoder, characterized in that it includes a processing circuit configured to execute any one of the first aspect, the second aspect, the third aspect, the fourth aspect, or the fifth aspect the method described.
  • an embodiment of the present application provides a decoder, characterized in that it includes a processing circuit configured to execute any one of the first aspect, the second aspect, the third aspect, the fourth aspect, or the fifth aspect the method described.
  • an embodiment of the present application provides a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the first aspect, the second aspect, the third aspect, the fourth aspect or the fifth aspect The method of any of the aspects.
  • embodiments of the present application provide an encoder, which may include one or more processors, a non-transitory computer-readable storage medium, coupled to the processors, and storing program instructions executed by the processors , wherein, when executed by the processor, the program instructions cause the encoder to implement the video frame compression method described in the first aspect, the second aspect or the third aspect.
  • embodiments of the present application provide a decoder, which may include one or more non-transitory computer-readable storage media, coupled to the processor, and storing program instructions executed by the processor, wherein, When executed by the processor, the program instructions cause the decoder to implement the video frame decompression method according to the fourth aspect or the fifth aspect when executed.
  • an embodiment of the present application provides a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium includes program codes, and when the non-transitory computer-readable storage medium is executed on a computer, causes the computer to execute the above The method of any of the first, second, third, fourth or fifth aspects.
  • an embodiment of the present application provides a circuit system, the circuit system includes a processing circuit, and the processing circuit is configured to execute the first aspect, the second aspect, the third aspect, the fourth aspect, or the fifth aspect The method of any of the aspects.
  • an embodiment of the present application provides a chip system, where the chip system includes a processor for implementing the functions involved in the above aspects, for example, sending or processing the data involved in the above method and/or information.
  • the chip system further includes a memory for storing necessary program instructions and data of the server or the communication device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • FIG. 1a is a schematic structural diagram of an artificial intelligence main body framework provided by an embodiment of the application.
  • FIG. 1b is an application scenario diagram of the video frame compression and decompression method provided by the embodiment of the application.
  • FIG. 1c is another application scenario diagram of the video frame compression and decompression method provided by the embodiment of the present application.
  • FIG. 2 is a schematic diagram of a principle of a video frame compression method provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a method for compressing a video frame according to an embodiment of the present application
  • FIG. 4 is a schematic diagram of the correspondence between the position of the current video frame and the adopted target neural network in the video frame compression method provided by the embodiment of the present application;
  • 5a is a schematic structural diagram of a first neural network provided by an embodiment of the present application.
  • FIG. 5b is a schematic structural diagram of a second neural network provided by an embodiment of the application.
  • 5c is a schematic diagram of a comparison of the first feature and the second feature in the video frame compression method provided by the embodiment of the present application;
  • FIG. 6 is a schematic diagram of another principle of a video frame compression method provided by an embodiment of the present application.
  • FIG. 7a is another schematic flowchart of a video frame compression method provided by an embodiment of the present application.
  • 7b is a schematic diagram of a first score value and a second score value in a video frame compression method provided by an embodiment of the present application;
  • 7c is a schematic diagram of calculating the coefficients and offset values of the first fitting formula and the coefficients and offsets of the second fitting formula in the video frame compression method provided by the embodiment of the present application;
  • FIG. 8 is another schematic flowchart of a video frame compression method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a video frame compression method provided by an embodiment of the present application.
  • 10a is a schematic flowchart of a method for decompressing a video frame provided by an embodiment of the present application
  • 10b is another schematic flowchart of a method for decompressing a video frame provided by an embodiment of the present application
  • FIG. 11 is another schematic flowchart of a method for decompressing a video frame provided by an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of a training method for a video frame compression and decompression system provided by an embodiment of the application;
  • FIG. 13 is a system architecture diagram of a video encoding and decoding system provided by an embodiment of the present application.
  • FIG. 14 is another system architecture diagram of a video encoding and decoding system provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a video decoding device provided by an embodiment of the application.
  • FIG. 16 is a simplified block diagram of an apparatus provided by an embodiment of the present application.
  • Figure 1a shows a schematic structural diagram of the main frame of artificial intelligence.
  • the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communicate with the outside through sensors; computing power is provided by a smart chip, as an example, the smart chip includes a central processing unit (CPU), a neural-network processing unit (NPU), a graphics processor (graphics unit) processing unit, GPU), application specific integrated circuit (ASIC), field programmable gate array (FPGA) and other hardware acceleration chips; the basic platform includes distributed computing framework and network-related platforms Guarantee and support can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • CPU central processing unit
  • NPU neural-network processing unit
  • graphics processor graphics processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • Guarantee and support can include cloud storage and computing, interconnection networks, etc.
  • sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, the productization of intelligent information decision-making, and the realization of landing applications. Its application areas mainly include: intelligent terminals, intelligent manufacturing, Smart transportation, smart home, smart healthcare, smart security, autonomous driving, smart city, etc.
  • FIG. 1b is an application scenario diagram of the video frame compression and decompression method provided by the embodiment of the present application.
  • the client's album can store videos, and the demand for sending the videos in the album to the cloud server will be stored.
  • the client ie, the encoder
  • the cloud server ie the decoder
  • the cloud server can use AI technology to decompress to obtain the reconstruction of the video frame frame
  • the surveillance meeting needs to send the collected video to the management center, then the surveillance (that is, the encoder) needs to compress the video frames in the video before sending the video to the management center , correspondingly, the management center (ie, the decoder) needs to decompress the video frames in the video to obtain the video frames.
  • the surveillance that is, the encoder
  • the management center ie, the decoder
  • FIG. 1 c is another application scenario diagram of the video frame compression and decompression method provided by the embodiment of the present application.
  • the anchor utilizes the client to collect the video
  • the client needs to send the collected video to the server
  • the server distributes the video to the viewing user
  • the client That is, the encoder
  • the client it needs to use AI technology to compress and encode the video frames in the video.
  • the client that is, the decoder
  • the client that is, the decoder
  • the example in FIG. 1c is only for the convenience of understanding this solution, and is not used to limit this solution.
  • AI technology that is, a neural network
  • the embodiment of the present application includes the inference stage of the aforementioned neural network and the training stage of the aforementioned neural network.
  • the processes of the inference phase and the training phase are different. The following describes the inference phase and the training phase respectively.
  • the operation of compression encoding is performed by the encoder, and the operation of decompression is performed by the decoder.
  • the operations of the encoder and the decoder are respectively described below.
  • the encoder since there are multiple neural networks configured in the encoder, the encoder generates the target compression information corresponding to the current video for the process. In an implementation manner, the encoder may first determine a target neural network from multiple neural networks according to a network selection strategy, and then generate target compression information of the current video frame through the target neural network.
  • the encoder may separately generate multiple pieces of compression information of the current video frame through multiple neural networks, and determine target compression information corresponding to the current video frame according to the multiple pieces of generated compression information. Since the implementation processes of the foregoing two implementation manners are different, they will be described separately below.
  • the encoder first selects the target neural network from multiple neural networks
  • the encoder first uses a network selection strategy to select a target neural network from multiple neural networks for processing the current video frame.
  • a network selection strategy for any video frame in the current video sequence (that is, the current video frame in Figure 2), the encoder will select a target neural network from multiple neural networks according to the network selection strategy, and use the target neural network.
  • the neural network compresses and encodes the current video frame, and obtains target compression information corresponding to the current video frame.
  • FIG. 3 is a schematic flowchart of a video frame compression method provided by an embodiment of the present application.
  • the video frame compression method provided by the embodiment of the present application may include:
  • the encoder determines a target neural network from multiple neural networks according to a network selection strategy.
  • the encoder is configured with multiple neural networks, and the multiple neural networks include at least a first neural network, a second neural network, or other neural networks for performing compression operations, the first neural network, the second neural network Networks and other types of neural networks are neural networks that have performed a training operation.
  • the encoder can determine the target neural network from multiple neural networks according to the network selection strategy, and compress and encode the current video frame through the target neural network,
  • the target compression information refers to the compression information that the encoder finally decides to send to the decoder, that is, the target compression information is generated by a target neural network among multiple neural networks.
  • video coding generally refers to the processing of image sequences that form a video or a video sequence.
  • picture In the field of video coding, the terms “picture”, “video frame” or “image” may be used as synonyms.
  • Video encoding is performed on the source side and typically involves processing (eg, compressing) the original video frame to reduce the amount of data required to represent the video frame (and thus store and/or transmit more efficiently).
  • Video decoding is performed on the destination side and typically involves inverse processing relative to the encoder to reconstruct video frames.
  • the encoding part and the decoding part are also collectively referred to as codec (encoding and decoding, CODEC).
  • the network selection strategy is related to any one or more of the following factors: the position information of the current video frame or the amount of data carried by the current video frame.
  • step 301 may include: the encoder may obtain position information of the current video frame in the current video sequence, where the position information is used to indicate that the current video frame is the Xth frame of the current video sequence; the encoder selects according to the network The strategy is to select a target neural network corresponding to the position information of the current video sequence from a plurality of neural networks including the first neural network and the second neural network.
  • the position information of the current video frame in the current video sequence may specifically be represented as an index number, and the index number may be specifically represented in the form of a character string.
  • the index number of the current video frame may specifically be 00000223, 00000368 or other Strings, etc., not exhaustive here.
  • the network selection strategy may be to alternately select the first neural network or the second neural network according to a certain rule, that is, the encoder uses the first neural network to compress and encode n video frames of the current video frame, and then uses the second neural network. compressing and encoding the m video frames of the current video frame; or, after the encoder uses the second neural network to compress and encode the m video frames of the current video frame, and then uses the first neural network to compress and encode the n video frames of the current video frame Video frames are compressed and encoded.
  • the values of n and m may both be integers greater than or equal to 1, and the values of n and m may be the same or different.
  • the values of n and m are both 1, and the network selection strategy may be to use the first neural network to compress and encode the odd-numbered frames in the current video sequence, and use the second neural network to compress the even-numbered frames in the current video sequence. Compression coding is performed; alternatively, the network selection strategy may be to use the second neural network to perform compression coding on odd-numbered frames in the current video sequence, and use the first neural network to perform compression coding on even-numbered frames in the current video sequence.
  • the value of n is 1 and the value of m is 3.
  • the network selection strategy may be that after each video frame in the current video sequence is compressed and encoded by the first neural network, the second The neural network compresses and encodes three consecutive video frames in the current video sequence, etc., which will not be exhaustive here.
  • FIG. 4 is a schematic diagram of the correspondence between the position of the current video frame and the adopted target neural network in the video frame compression method provided by the embodiment of the present application.
  • the encoder uses the first neural network to compress and encode the t-th video frame, and then uses the second neural network to compress and encode the t-th video frame.
  • the network compresses and encodes the t+1 frame, the t+2 frame and the t+3 frame video frame respectively, and uses the first neural network again to compress and encode the t+4 frame.
  • the network After the network performs one compression encoding on one current video frame, it will use the second neural network to perform one compression encoding on three current video frames. It should be understood that the example in FIG. .
  • step 301 may include: the encoder may acquire attributes of the current video frame, and select a target neural network from the first neural network and the second neural network, wherein the attributes of the current video frame are used to indicate the current
  • the amount of data carried by the video frame, and the attributes of the current video frame include any one or a combination of the following: entropy, contrast, saturation, and other types of attributes of the current video frame, which are not exhaustive here.
  • the higher the entropy of the current video frame the greater the amount of data carried by the current video frame, the greater the probability that the target neural network uses the second neural network, the lower the entropy of the current video frame, and the target neural network uses the second neural network.
  • the target neural network is selected from multiple neural networks according to the position information of the current video frame in the current video sequence; or, the target may be selected from multiple neural networks according to at least one attribute of the current video
  • a neural network can be used to generate the compression information of the current video frame by using the target neural network, which provides a variety of simple and easy-to-operate implementation schemes and improves the implementation flexibility of the scheme.
  • the encoder can arbitrarily select one neural network from the first neural network and the second neural network as the target neural network, so as to use the target neural network to generate target compression information of the current video frame.
  • the encoder can configure the first selection probability of the first neural network and the second selection probability of the second neural network respectively, and the value of the second selection probability is greater than or equal to the first selection probability, and then according to the first selection probability and the second selection probability to perform the selection operation of the target neural network.
  • the value of the first selection probability is 0.2
  • the value of the second selection probability is 0.8
  • the value of the first selection probability is 0.3
  • the value of the second selection probability is 0.7, etc.
  • the values of the first selection probability and the second selection probability are not exhaustively enumerated here.
  • the encoder compresses and encodes the current video frame through the target neural network, so as to obtain target compression information corresponding to the current video frame.
  • the target neural network may be a first neural network, a second neural network, or other networks for compressing video frames, or the like. If the target compression information is obtained through the first neural network, the target compression information includes the first compression information of the first feature of the current video frame, the reference frame of the current video frame is used for the compression process of the first feature of the current video frame, and the current video frame The reference frame of the video frame is not used in the generation process of the first feature of the current video frame.
  • the reference frame of the current video frame and the current video frame are both derived from the current video sequence; the current video frame is the original video frame included in the current video sequence.
  • the reference frame of the current video frame may be an original video frame in the current video sequence, and the sorting position of the reference frame in the current video sequence may be located before the current video frame, or may be located after the current video frame, That is, when the current video sequence is played, the reference frame may appear earlier than the current video frame, or may be later than the current video frame.
  • the reference frame of the current video frame may not be the original video frame in the current video sequence, and the sorting position of the original reference frame corresponding to the reference frame of the current video frame in the current video sequence may be located in the current video frame frame before, or after the current video frame.
  • the reference frame of the current video frame may be a video frame obtained after the encoder performs transform coding on the original reference frame and inversely transforms and decodes it; The resulting video frame after decompression.
  • the aforementioned compression operation may be implemented by a first neural network, or may be implemented by a second neural network.
  • the first neural network may at least include an encoding (Encoding) network and an entropy encoding layer, wherein the first feature of the current video frame is obtained from the current video frame through the encoding network. ; perform the compression process of the first feature of the current video frame by using the reference frame of the current video frame through the entropy coding layer, and output the first compression information corresponding to the current video frame.
  • Encoding encoding
  • entropy encoding layer wherein the first feature of the current video frame is obtained from the current video frame through the encoding network.
  • FIG. 5a is a schematic structural diagram of the first neural network provided by the embodiment of the present application.
  • the current video frame is encoded through an encoding network and quantized to obtain the first feature of the current video frame.
  • the reference frame of the current video frame through the entropy coding layer, compressing the first feature of the current video frame, and outputting the first compression information corresponding to the current video frame (that is, an example of the target compression information corresponding to the current video frame).
  • the example in FIG. 5a is only to facilitate understanding of the present solution, and is not intended to limit the present solution.
  • the encoder is directed to the process in which the encoder generates the first compression information corresponding to the current video frame through the first neural network.
  • the encoder can transform and encode the current video frame through the first encoding network (Encoding Network). After transform encoding, it will quantize to obtain the first feature of the current video frame. It can be obtained based on the current video frame, and the reference frame of the current video frame does not need to be used in the generation process of the first feature.
  • the first encoding network may specifically be represented as a multi-layer convolutional network.
  • the first feature includes the features of M pixels, which can be expressed as an L-dimensional tensor. tensor or higher-dimensional tensor, etc., no exhaustive list is made here.
  • the encoder predicts the features of the current video frame according to the N reference frames of the current video frame to generate the first predicted feature of the current video frame, and generates the first feature of the current video frame according to the first predicted feature of the current video frame probability distribution of .
  • the encoder performs entropy coding on the first feature of the current video frame according to the probability distribution of the first feature of the current video frame to obtain the first compression information.
  • the first prediction feature of the current video frame is the prediction result of the first feature of the current video frame
  • the first prediction feature of the current video frame also includes the features of M pixels
  • the first prediction feature of the current video frame may specifically represent is a tensor
  • the data shape of the first prediction feature of the current video frame is the same as the data shape of the first feature of the current video frame
  • the shape of the first prediction feature and the first feature are the same, which means the first prediction feature and the first feature
  • Both are L-dimensional tensors
  • the first dimension in the L-dimension of the first prediction feature and the second dimension in the L-dimension of the first feature are the same size
  • L is an integer greater than or equal to 1
  • the first dimension is the first dimension.
  • the second dimension is the same dimension as the first dimension in the L dimension of the first feature.
  • the probability distribution of the first feature of the current video frame includes the mean value of the first feature of the current video frame and the variance of the first feature of the current video frame.
  • both the mean value of the first feature and the method of the first feature can be expressed as an L-dimensional tensor
  • the data shape of the mean value of the first feature is the same as the data shape of the first feature
  • the shape of the variance of the first feature is the same as that of the first feature.
  • the data shapes of a feature are the same, so that the mean of the first feature includes a value corresponding to each of the M pixels, and the variance of the first feature includes a value corresponding to each of the M pixels.
  • the encoder predicts the features of the current video frame according to the N reference frames of the current video frame to generate the first prediction feature of the current video frame.
  • the encoder predicts the features of the current video frame according to the N reference frames of the current video frame to generate the first prediction feature of the current video frame.
  • the features of the first video frame are predicted based on N second video frames, so as to generate the first predicted features of the first video frame, and according to the first video frame
  • the first predicted feature of the frame generates a probability distribution of the first feature of the first video frame.
  • the current video frame is predicted to generate the first prediction feature of the current video frame, and according to the first prediction feature of the current video frame, the current video frame is generated. The probability distribution of the first feature.
  • the encoder since the encoder generates the probability distribution of the first feature of the current video frame according to the first prediction feature corresponding to the current video frame, and then according to the probability distribution of the first feature of the current video frame, the current video frame to compress and encode the first feature of the current video frame, so as to obtain the first compression information of the current video frame. Since the higher the similarity between the first prediction feature and the first feature, the higher the compression rate of the first feature will be, and finally obtained The first compression information of the current video frame will be smaller, and the first prediction feature of the current video frame is obtained by predicting the features of the current video frame according to the N reference frames of the current video frame, so as to improve the first prediction of the current video frame.
  • the similarity between the feature and the first feature of the current video frame can reduce the size of the compressed first compressed information, that is, it can not only ensure the quality of the reconstructed frame obtained by the decoder, but also reduce the number of encoders and decoders.
  • the target compression information includes the second compression information of the second feature of the current video frame, and the reference frame of the current video frame is used for the generation process of the second feature of the current video frame.
  • the second neural network includes a convolutional network and an entropy coding layer, the convolutional network includes a plurality of convolutional layers and an excitation ReLU layer, wherein the generation of the second feature of the current video frame is performed by using the reference frame of the current video frame through the convolutional network In the process, the second feature of the current video frame is compressed by the entropy coding layer, and the second compression information corresponding to the current video frame is output.
  • the encoder may perform compression encoding on the aforementioned optical flow to obtain the compressed optical flow.
  • the second feature of the current video frame may only include the optical flow of the original current video frame relative to the reference frame of the current video frame.
  • the encoder can also generate the predicted current video frame according to the optical flow of the original current video frame relative to the reference frame of the current video frame and the reference frame of the current video frame; the encoder calculates the original current video frame and the predicted current video frame.
  • the residual between the current video frame and the optical flow between the original current video frame relative to the reference frame of the current video frame, and the residual between the original current video frame and the predicted current video frame are compressed encoding, and outputting the second compression information corresponding to the current video frame.
  • the second feature of the current video frame includes the optical flow of the original current video frame relative to the reference frame of the current video frame and the residual between the original current video frame and the predicted current video frame.
  • the encoder can directly perform a compression operation on the second feature of the current video frame to obtain a The second compression information corresponding to the video frame.
  • the aforementioned compression operation may be implemented by a neural network or a non-neural network manner.
  • the aforementioned compression coding manner may be entropy coding.
  • FIG. 5b is a schematic structural diagram of the second neural network provided by the embodiment of the present application.
  • the encoder inputs the current video frame and the reference frame of the current video frame to the convolutional network, and performs optical flow estimation through the convolutional network to obtain the optical flow of the current video frame relative to the reference frame of the current video frame.
  • the encoder generates the reconstructed frame of the current video frame through the convolutional network according to the optical flow of the current video frame relative to the reference frame of the current video frame, and the reference frame of the current video frame; and obtains the reconstructed frame of the current video frame and the current video frame. residuals between.
  • the encoder can compress the optical flow of the current video frame relative to the reference frame of the current video frame, and the residual difference between the reconstructed frame of the current video frame and the current video frame through the entropy coding layer, and output the second image of the current video frame.
  • the example in FIG. 5b is only for the convenience of understanding this solution, and is not used to limit this solution.
  • FIG. 5c is a schematic diagram of a comparison of the first feature and the second feature in the video frame compression method provided by the embodiment of the present application.
  • Figure 5c includes two sub-diagrams (a) and (b), the sub-diagram (a) of Figure 5c represents a schematic diagram of generating the first feature of the current video frame, and the sub-diagram (b) of Figure 5c represents the generation of the current video frame A schematic diagram of the second feature of .
  • the current video frame is input to the encoding network, and after transform encoding is performed by the encoding network, quantization (Q) is performed to obtain the first feature of the current video frame.
  • the content in the dotted box in the sub-schematic diagram (b) of FIG. 5c represents the second feature of the current video frame, since the sub-schematic diagram (b) of FIG.
  • the second feature of the video frame includes not only the optical flow of the original current video frame relative to the reference frame of the current video frame, but also the residual between the original current video frame and the predicted current video frame, which is not one by one here.
  • the process of generating the second feature of the current video frame is described in detail.
  • the encoder can also be configured with other neural networks for compressing and encoding video frames (hereinafter referred to as "fifth neural network" for convenience of description), but the encoder is configured with at least the first neural network and For the second neural network, the detailed process of using the first neural network and the second neural network to perform compression coding will be described in subsequent embodiments, and will not be introduced here for the time being.
  • the fifth neural network can be a neural network that directly compresses the current video frame, that is, the encoder can input the current video frame into the fifth neural network, and directly compress the current video frame through the fifth neural network to obtain The third compression information corresponding to the current video frame output by the fifth neural network.
  • the fifth neural network may specifically use a convolutional neural network.
  • the encoder generates indication information corresponding to the target compression information, where the indication information is used to indicate that the target compression information is obtained through the target neural network in the first neural network and the second neural network.
  • the encoder may further generate at least one indication information corresponding to the target compression information of at least one current video frame, the aforementioned at least one indication information It is used to indicate that each target compression information is obtained through the target neural network in the first neural network and the second neural network, that is, the indication information is used to indicate that a target compression information is obtained through the first neural network and the second neural network. which neural network obtained.
  • the multiple indication information corresponding to the target compression information of multiple video frames in the current video sequence may specifically be expressed as a character string or other forms.
  • the multiple indication information corresponding to the target compression information of multiple video frames in the current video sequence may specifically be 0010110101, and one character in the aforementioned character string represents one indication information, and when one indication information is 0, it represents The current video frame corresponding to the indication information is compressed by the first neural network; when one indication information is 1, it means that the current video frame corresponding to the indication information is compressed by the second neural network.
  • the encoder may generate an indication information corresponding to the target compression information of a current video frame after obtaining the target compression information of a current video frame, that is, the encoder may alternately execute Step 303 and Steps 301 to 302.
  • the encoder may also generate a preset number of indications corresponding to the preset number of current video frames after generating the target compression information of the preset number of current video frames in step 301.
  • the preset number is an integer greater than 1. As an example, it may be 3, 4, 5, 6, or other values, which are not limited here.
  • the encoder can also generate multiple target compression information corresponding to the entire current video sequence through steps 301 and 302, and then generate multiple indication information corresponding to the entire current video sequence through step 303.
  • the implementation method is not limited here.
  • the encoder sends target compression information of the current video frame.
  • the encoder may send target compression information of at least one current video frame in the current video sequence to the decoder based on the constraints of a file transfer protocol (FTP).
  • FTP file transfer protocol
  • the encoder can directly send at least one target compressed information to the decoder; in another implementation, the encoder can also send at least one target compressed information to a server or a management center, etc.
  • Intermediate device sent by the intermediate device to the decoder.
  • the encoder can also generate the first prediction feature of the current video frame according to the method, While sending the first compression information of the current video frame to the decoder, send the first inter-frame side information, the second inter-frame side information, the first intra-frame side information, the second frame side information corresponding to the current video frame to the decoder One or two kinds of information in the inner side information; correspondingly, the decoder can receive the first inter-frame side information, the second inter-frame side information, the first frame inner side information, the second frame side information corresponding to the current video frame One or both of the inside information.
  • the specific type of information to be sent needs to be determined in combination with which type of information is required in the process of decompressing the first compressed information of the current video frame.
  • first inter-frame side information the second inter-frame side information, the first frame side information and the second frame side information
  • the encoder sends indication information corresponding to the target compression information of the current video frame.
  • step 305 is an optional step. If step 303 is not performed, step 305 is not performed, and if step 303 is performed, step 305 is performed. If step 305 is performed, then step 305 and step 304 may be performed simultaneously, that is, the encoder sends the target of at least one current video frame in the current video sequence to the decoder based on the constraints of the FTP protocol (that is, the file transfer protocol for short). compression information, and at least one indication information one-to-one corresponding to the target compression information of the at least one current video frame. Alternatively, step 304 and step 305 may also be executed separately, and the embodiment of the present application does not limit the execution order of step 304 and step 305 .
  • the decoder can obtain multiple indication information corresponding to multiple target compression information, so that the decoder can know which of the first neural network and the second neural network is used for each video frame in the current video sequence.
  • the neural network is used to perform the decompression operation, which is beneficial to improve the time for the decoder to decode the compressed information, that is, to improve the efficiency of the video frame transmission of the entire encoder and the decoder.
  • the compression information when the compression information is obtained through the first neural network, the compression information carries the compression information of the first feature of the current video frame, and the reference frame of the current video frame is only used for the first feature of the current video frame.
  • the feature compression process is not used for the generation process of the first feature of the current video frame, so that the decoder does not need to use the reference of the current video frame after performing the decompression operation according to the first compression information to obtain the first feature of the current video frame.
  • the frame can get the reconstructed frame of the current video frame, so when the compression information is obtained through the first neural network, the quality of the reconstructed frame of the current video frame will not depend on the quality of the reconstructed frame of the reference frame of the current video frame, and then The accumulation of errors between frames is avoided to improve the quality of the reconstructed frame of the video frame; in addition, since the second feature of the current video frame is generated according to the reference frame of the current video frame, the second compression information of the second feature is determined by the second feature. The corresponding data volume is smaller than the data volume corresponding to the first compressed information of the first feature.
  • the encoder can use the first neural network and the second neural network to process different video frames in the current video sequence to synthesize the first neural network. The advantages of the first neural network and the second neural network are to improve the quality of the reconstructed frame of the video frame on the basis of reducing the amount of data to be transmitted as much as possible.
  • the target compression information is determined.
  • the encoder first compresses and encodes the current video frame through a plurality of different neural networks, and then determines the target compression information corresponding to the current video frame.
  • FIG. 6 is a schematic diagram of another principle of a video frame compression method provided by an embodiment of the present application.
  • the encoder compresses and encodes the current video frame through the first neural network, and obtains the first compression information of the first feature of the current video frame. (that is, rp in FIG.
  • a reconstructed frame of the current video frame (that is, d p in FIG. 6 ) is generated.
  • the current video frame is compressed and encoded by the second neural network to obtain second compression information of the second feature of the current video frame (that is, r r in FIG. 6 ), and a reconstructed frame of the current video frame is generated according to the second compression information ( ie dr in Figure 6).
  • the encoder determines the target compression information corresponding to the current video frame from the first compression information and the second compression information according to rp , d p , r r , d r and the network selection strategy. It should be understood that the example in FIG. 6 only For the convenience of understanding this scheme, it is not used to limit this scheme.
  • FIG. 7a is another schematic flowchart of the video frame compression method provided by the embodiment of the present application.
  • the video frame compression method provided by the embodiment of the present application may include:
  • the encoder compresses and encodes the current video frame through the first neural network to obtain the first compression information of the first feature of the current video frame, and the reference frame of the current video frame is used for the compression process of the first feature of the current video frame. .
  • the encoder may compress and encode the current video frame through the first neural network in the multiple neural networks, so as to obtain the first compression information of the first feature of the current video frame.
  • the meaning of the first feature of the current video frame, the meaning of the first compression information of the first feature of the current video frame, and the specific implementation of step 701 can all refer to the description in the corresponding embodiment of FIG. 3 , and will not be repeated here. .
  • the encoder generates a first video frame through the first neural network, where the first video frame is a reconstructed frame of the current video frame.
  • the encoder may further perform decompression processing through the first neural network to generate the first video frame,
  • the first video frame is a reconstructed frame of the current video frame.
  • the first compression information includes compression information of the first feature of the current video frame, and the reference frame of the current video frame is used in the decompression process of the first compression information to obtain the first feature of the current video frame.
  • a feature is used in the generation process of the reconstructed frame of the current video frame. That is, after decompressing the first compressed information, the encoder can obtain the reconstructed frame of the current video frame without using the reference frame of the current video frame.
  • the first neural network may also include an entropy decoding layer and a decoding (Decoding) network, wherein the decompression process of the first compression information of the current video frame is performed by using the reference frame of the current video frame through the entropy decoding layer, and the current video frame is used through the decoding network.
  • the first feature of the frame generates a reconstructed frame of the current video frame.
  • the encoder can predict the features of the current video frame according to the reconstructed frames of the N reference frames of the current video frame through the entropy decoding layer, so as to obtain the first prediction feature of the current video frame, and use the entropy decoding layer according to the current video frame.
  • the first predicted feature of the video frame generates a probability distribution of the first feature of the current video frame.
  • the encoder performs entropy decoding on the first compressed information of the current video frame according to the probability distribution of the first feature of the current video frame through the entropy decoding layer, and obtains the first feature of the current video frame.
  • the encoder also performs inverse transform decoding on the first feature of the current video frame through a first decoder network to obtain a reconstructed frame of the current video frame.
  • the first decoding network corresponds to the first encoding network, and the first decoding network can also be expressed as a multi-layer convolutional network.
  • the encoder generates the first prediction feature of the current video frame according to the reconstructed frames of the N reference frames of the current video frame, and the encoder generates the first prediction feature of the current video frame according to the N reference frames of the current video frame.
  • the specific implementation mode of a prediction feature is similar; the specific implementation mode of the encoder generating the probability distribution of the first feature of the current video frame according to the first prediction feature of the current video frame is the same as the encoder according to the first prediction feature of the current video frame,
  • the specific implementation manner of generating the probability distribution of the first feature of the current video frame is similar; for the specific implementation manner of the foregoing steps, reference may be made to the description of step 302 in the corresponding embodiment of FIG. 3 , which is not repeated here.
  • the encoder compresses and encodes the current video frame through the second neural network to obtain the second compression information of the second feature of the current video frame, and the reference frame of the current video frame is used for the generation process of the second feature of the current video frame. .
  • the encoder may compress and encode the current video frame through the second neural network in the plurality of neural networks, so as to obtain the second compression information of the second feature of the current video frame.
  • the meaning of the second feature of the current video frame, the meaning of the second compression information of the second feature of the current video frame, and the specific implementation of step 701 can all be referred to the description in the corresponding embodiment of FIG. 3 , and will not be repeated here. .
  • the encoder generates a second video frame through the second neural network, where the second video frame is a reconstructed frame of the current video frame.
  • the encoder may further perform decompression processing through the second neural network to generate the second video frame,
  • the second video frame is a reconstructed frame of the current video frame.
  • the second neural network may also include an entropy decoding layer and a convolutional network, entropy decoding is performed on the second compressed information through the entropy decoding layer, and the convolutional network is performed by using the reference frame of the current video frame and the second feature of the current video frame. The generation process of the reconstructed frame of the current video frame.
  • the encoder can perform entropy decoding on the second compressed information through the entropy decoding layer to obtain the second feature of the current video frame, that is, obtain the optical flow of the original current video frame relative to the reference frame of the current video frame;
  • the second feature of the current video frame further includes a residual between the original current video frame and the predicted current video frame.
  • the encoder predicts the current video frame according to the optical flow of the original current video frame relative to the reference frame of the current video frame and the reference frame of the current video frame, and obtains the predicted current video frame;
  • the residual between the frame and the predicted current video frame and the predicted current video frame generate a second video frame (ie, a reconstructed frame of the current video frame).
  • the encoder determines the target compression information corresponding to the current video frame according to the first compression information, the first video frame, the second compression information and the second video frame, wherein the determined target compression information is obtained through the first neural network.
  • the determined target compression information is the first compression information; or, the determined target compression information is obtained through the second neural network, and the determined target compression information is the second compression information.
  • the encoder may calculate the first score value corresponding to the first compression information (that is, the first score value corresponding to the first neural network) according to the first compression information and the first video frame, Two compressed information and a second video frame, calculate the second score value corresponding to the second compressed information (that is, the second score value corresponding to the second neural network), the encoder according to the first score value and the second score value, Determine the target compression information corresponding to the current video frame.
  • the target compression information is the first compression information obtained through the first neural network
  • the target neural network is the first neural network
  • the target neural network is the second neural network.
  • the first score value is used to reflect the performance of performing the compression operation on the current video frame by using the first neural network
  • the second score value is used to reflect the performance of the compression operation performed on the current video frame by using the second neural network.
  • the calculation process for the first scoring value and the second scoring value is performed after obtaining the first compression information and the first video frame.
  • the encoder can obtain the data amount of the first compression information, calculate the first compression rate of the first compression information relative to the current video frame, and calculate the first video frame
  • the first score value is generated according to the first compression rate of the first compression information relative to the current video frame and the image quality of the first video frame.
  • the larger the data volume of the first compressed information the larger the value of the first score value
  • the smaller the data volume of the first compressed information the smaller the value of the first score value.
  • the first compression ratio of the first compression information relative to the current video frame may refer to a ratio between the data amount of the first compression information and the data amount of the current video frame.
  • the encoder can calculate the structural similarity (structural similarity index, SSIM) between the current video frame and the first video frame to indicate the image quality of the first video frame according to the index of "structural similarity". It should be noted that, The encoder can also measure the image quality of the first video frame by other indicators, as an example, for example, the indicator of "structural similarity” can also be replaced by multiscale structural similarity index (MS-SSIM), Peak signal to noise ratio (peak signal to noise ratio, PSNR) or other indicators, etc., are not exhaustive here.
  • MS-SSIM multiscale structural similarity index
  • PSNR peak signal to noise ratio
  • the encoder may perform a weighted sum of the first compression ratio and the image quality of the first video frame to generate a The first score value corresponding to the first neural network. It should be noted that, after obtaining the first compression rate and the image quality of the first video frame, the encoder may also obtain the first score value in other ways.
  • the image quality multiplication, etc., specifically according to the first compression rate and the image quality of the first video frame to obtain the first score value, can be flexibly determined in combination with the actual application scenario, and the list is not exhaustive here.
  • the encoder can calculate the data amount of the second compression information and the image quality of the second video frame, and then calculate the data amount of the second compression information and the second video frame according to the data amount of the second compression information and the second video frame.
  • the image quality of the second score value is generated, and the generation method of the second score value is similar to the generation method of the first score value, which can be referred to the above description, and will not be repeated here.
  • a process of determining target compression information corresponding to the current video frame according to the first score value and the second score value after obtaining and calculating the first score value corresponding to the first compressed information and the second score value corresponding to the second compressed information, the encoder can calculate the first score value and the second score value from the second score value. The target score value with a smaller value is selected from the score values, and the compressed information corresponding to the target score value is determined as the target compressed information. The encoder performs the foregoing operations on each video frame in the video sequence to obtain target compression information corresponding to each video frame.
  • FIG. 7b is a schematic diagram of the first score value and the second score value in the video frame compression method provided by the embodiment of the present application.
  • the abscissa of Fig. 7b represents the position information of a video frame in the current video sequence
  • the ordinate of Fig. 7b represents the score value corresponding to each video frame
  • A1 represents the evaluation of multiple video frames in the current video sequence.
  • A2 represents the polyline corresponding to the second score value in the process of compressing multiple video frames in the current video sequence.
  • A3 represents the first and second score values obtained when the first neural network and the second neural network are used to compress the video frame 1 respectively. It can be seen from Figure 7b that the first neural network is used to compress the video frame 1 The obtained score value is lower, so the encoder will use the first neural network to process video frame 1. After using the first neural network to process video frame 1, it will be compared with video frame 2 (that is, the video in the current video sequence). The first score value and the second score value corresponding to the next video frame of frame 1) both drop significantly; that is, every time a video frame is compressed by the first neural network, a new cycle will be triggered.
  • the value of the first score value increases linearly
  • the value of the second score value also increases linearly
  • the growth rate of the second score value is higher than the growth rate of the first score value.
  • l pi represents the starting point of the straight line corresponding to the multiple first score values in one cycle, that is, the offset of the first fitting formula corresponding to the multiple first score values
  • k pi represents the corresponding one cycle.
  • the slope of the straight line corresponding to the multiple first scoring values in the The number of video frames in the interval between video frames, as an example, for example, the value of t corresponding to the second video frame in a period is 1.
  • multiple second score values can be fitted into the following formula:
  • l pi represents the starting point of the straight line corresponding to the multiple second score values in one cycle, that is, the offset of the second fitting formula corresponding to the multiple second score values
  • k pi represents the corresponding one cycle.
  • the slope of the straight line corresponding to the plurality of second score values within is, that is, the coefficient of the second fitting formula corresponding to the plurality of second score values.
  • the total score value corresponding to a cycle can be fitted into the formula:
  • loss represents the sum of all score values in a cycle
  • T represents the total number of video frames in a cycle
  • the first T-1 video frames in a cycle are compressed by the second neural network, and the last video frame is compressed by the first neural network. Therefore, l pr +(l pr +k pr )+ ...+(l pr +(T-2)*k pr ) represents the sum of at least one second score value corresponding to all video frames compressed by the second neural network in one cycle, l pi +(T-1 )*k pi represents the first score value corresponding to the last video frame in each period.
  • the encoder can use one cycle as the calculation unit, and the goal is to minimize the average value of the total score values in each cycle.
  • the following is shown in the form of a formula:
  • T and the meaning of loss can be referred to the above description of formula (3), which will not be repeated here.
  • Equation (3) Substituting Equation (3) into Equation (4), the following formula can be obtained:
  • the encoder first obtains two corresponding to the first two current video frames in a cycle.
  • the encoder obtains two second score values corresponding to the first two current video frames in a cycle; for the first score value corresponding to one current video frame and the second score value corresponding to one current video frame.
  • the encoder generates, according to the two first score values corresponding to the first two current video frames in a cycle, the coefficients and offset values of the first fitting formula corresponding to the multiple first score values in a cycle, That is, the values of l pi and k pi can be generated.
  • the encoder generates, according to the two second score values corresponding to the first two current video frames in one cycle, the values of the coefficients and offsets of the second fitting formula corresponding to the plurality of second score values in one cycle, That is, the values of l pr and k pr can be generated.
  • the encoder After obtaining the values of the coefficients and the offsets of the first fitting formula and the values of the coefficients and the offsets of the second fitting formulas, the process of determining the target compression information of the current video frame.
  • the encoder determines the second compression information corresponding to the first video frame in a period as the current video frame (that is, the first video frame in a period ) of the target compression information, that is, the target neural network corresponding to the first video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 1.
  • the encoder When t is equal to 1, that is, the encoder obtains two first score values corresponding to the first two current video frames in a cycle, and obtains two first score values corresponding to the first two current video frames in a cycle After the second score value, the value of T can be calculated based on formula (5). If T ⁇ 3, the encoder determines the first compression information corresponding to the second video frame in a period as the target compression information of the current video frame (that is, the second video frame in a period), and also That is, the target neural network corresponding to the second video frame in one cycle is the first neural network, and is triggered to enter the next cycle.
  • the encoder determines the second compression information corresponding to the second video frame in a period as the target compression information of the current video frame (that is, the second video frame in a period), and That is, the target neural network corresponding to the second video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 2.
  • the encoder When t is equal to 2, the encoder obtains the first score value and the second score value corresponding to the third video frame in a period (that is, an example of the current video frame), specifically the first score value corresponding to a current video frame The method of generating the first rating value and the second rating value will not be repeated here.
  • the encoder recalculates the coefficients and offset values of the first fitting formula (that is, the recalculated values of lpi and kpi ) according to the three first score values corresponding to the first three video frames in a cycle.
  • the coefficients and offset values of the second fitting formula (that is, the recalculated values of l pr and k pr )
  • the value of T is recalculated according to the recalculated coefficients and offset values of the first fitting formula and the recalculated coefficients and offset values of the second fitting formula.
  • the encoder may determine the first compression information corresponding to the third video frame in a period as the target compression of the current video frame (that is, the third video frame in a period).
  • information, that is, the target neural network corresponding to the third video frame in a cycle is the first neural network, and is triggered to enter the next cycle.
  • the encoder may determine the second compression information corresponding to the third video frame in a period as the target compression of the current video frame (that is, the third video frame in a period).
  • information, that is, the target neural network corresponding to the third video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 3.
  • the processing method of the encoder is similar to the processing method when t is equal to 2, which is not repeated here.
  • the encoder determines the second compression information corresponding to the first video frame in a period as the current video frame (that is, the first video frame in a period frame), that is, the target neural network corresponding to the first video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 1.
  • the encoder can obtain two first score values corresponding to the first two current video frames in a cycle, and obtain two first score values corresponding to the first two current video frames in a cycle
  • the coefficients and offset values of the first fitting formula that is, the values of l pi and k pi
  • the coefficients and offset values of the second fitting formula that is, l pi and k pi
  • the encoder determines that the target compression information corresponding to the second video frame in the period is the first compression information of the current video frame, that is, the target compression information corresponding to the second video frame in the period
  • the target neural network corresponding to the video frame is the first neural network and is triggered to enter a new cycle.
  • the encoder may determine the first compression information corresponding to the second video frame in the period as the target compression information of the current video frame, that is, the first compression information corresponding to the second video frame in the period.
  • the target neural network corresponding to the two video frames is the first neural network, and is triggered to enter a new cycle.
  • the encoder may also determine the second compression information corresponding to the second video frame in the period as the target compression information of the current video frame, that is, the target neural network corresponding to the second video frame in the period
  • the network is the second neural network and continues to process the case where t is equal to 2.
  • the encoder determines the second compression information corresponding to the second video frame in the period as the target compression information of the current video frame, that is, the second compression information corresponding to the second video frame in the period
  • the target neural network corresponding to the two video frames is the second neural network, and continues to process the case where t is equal to 2.
  • the encoder can obtain a first score value corresponding to the third video frame in a cycle, and obtain a second score value corresponding to the first two current video frames in a cycle, specifically a How to generate the first score value and the second score value corresponding to the current video frame will not be described here.
  • the encoder recalculates the coefficients and offset values of the first fitting formula (that is, the recalculated values of lpi and kpi ) according to the three first score values corresponding to the first three video frames in a cycle.
  • the updated first average value is the average value of the total score value of the whole cycle obtained after the third video frame (that is, an example of the current video frame) in the cycle is compressed by the first neural network
  • the updated second average value is to use the second neural network to compress the third video frame in the period (that is, an example of the current video frame), and use the first neural network to compress the fourth video frame in the period.
  • the average value of the total score value of the whole cycle obtained after each video frame is compressed.
  • the encoder determines that the target compression information corresponding to the third video frame in the period is the first compression information of the current video frame, that is, the target compression information corresponding to the third video frame in the period
  • the target neural network corresponding to the third video frame in the cycle is the first neural network, and is triggered to enter a new cycle.
  • the encoder may determine the first compression information corresponding to the third video frame in the period as the target compression information of the current video frame, that is, The target neural network corresponding to the third video frame in the cycle is the first neural network, and is triggered to enter a new cycle.
  • the encoder may also determine the second compression information corresponding to the third video frame in the period as the target compression information of the current video frame, that is, the target neural network corresponding to the third video frame in the period
  • the network is the second neural network and continues to process the case where t is equal to 3.
  • the encoder determines the second compression information corresponding to the third video frame in the period as the target compression information of the current video frame, that is, The target neural network corresponding to the third video frame in this period is the second neural network, and continues to process the case where t is equal to 3.
  • the processing method of the encoder is similar to the processing method when t is equal to 2, which is not repeated here.
  • the technician finds the change rule of the first score value and the second score value in a single cycle during the research, and takes the minimum average value of the total score value in a cycle as the optimization goal, that is, in the When determining the target compression information corresponding to each current video frame, not only the score value of the current video frame, but also the average value of the score value in the whole period should be considered, so as to further reduce the correlation with all video frames in the entire current video sequence.
  • the corresponding score value is used to further improve the performance of the compressed information corresponding to the entire current video sequence; in addition, two different implementation modes are provided, which improves the implementation flexibility of this solution.
  • the encoder also uses one cycle as a calculation unit, and the goal is to minimize the average value of the total score values in each cycle.
  • t is equal to 0 and 1
  • the encoder may determine the second compression information corresponding to the third video frame in a period as the target compression of the current video frame (that is, the third video frame in a period).
  • information, that is, the target neural network corresponding to the third video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 3.
  • the processing method of the encoder is similar to the processing method when t is equal to 2, which is not repeated here.
  • FIG. 7c is a method for compressing a video frame provided by an embodiment of the present application to calculate the coefficients of the first fitting formula and the values of the offset and the coefficients of the second fitting formula. and a schematic of the offset.
  • FIG. 7c between the two vertical dashed lines represents the processing of video frames in one cycle, and one cycle includes the compression and encoding of multiple video frames through the second neural network, and the compression and encoding of multiple video frames through the first neural network.
  • the network compresses and encodes the last video frame in the cycle.
  • the encoder first obtains two first score values corresponding to the first two current video frames in a cycle (that is, the first video frame and the second video frame), and obtains the first two score values corresponding to the first two video frames in a cycle.
  • the coefficients and offset values of the first fitting formula that is, the values of l pi and k pi
  • the coefficients of the second fitting formula and The value of the offset that is, the values of l pr and k pr
  • the coefficients and biases of the first fitting formula are calculated and obtained only according to the two first score values and the two second score values corresponding to the first two video frames in a cycle.
  • the value of the offset, the coefficient of the second fitting formula and the value of the offset and then take the lowest average value of the total score value in the entire cycle as the optimization goal, and obtain the optimal number of video frames in the current cycle.
  • the values of the coefficients and offsets of the first fitting formula and the values of the coefficients and offsets of the second fitting formula save the calculation time of the parameters of the first fitting formula and the second fitting formula, so that Improves the efficiency of generating compression information for the current video sequence.
  • the encoder also uses one cycle as a calculation unit, and the goal is to minimize the average value of the total score values in each cycle.
  • t is equal to 0 and 1
  • the first score values corresponding to the three video frames that is, an example of the current video frame
  • only the coefficients and offset values of the second fitting formula are recalculated, and the values of the first fitting formula are not recalculated.
  • the encoder may determine the first compression information corresponding to the third video frame in a period as the target compression of the current video frame (that is, the third video frame in a period).
  • information, that is, the target neural network corresponding to the third video frame in a cycle is the first neural network, and is triggered to enter the next cycle.
  • the encoder may determine the second compression information corresponding to the third video frame in a period as the target compression of the current video frame (that is, the third video frame in a period).
  • information, that is, the target neural network corresponding to the third video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 3.
  • the processing method of the encoder is similar to the processing method when t is equal to 2, which is not repeated here.
  • the compression information that needs to be sent finally is selected;
  • the network selection strategy determines the target neural network from the first neural network and the second neural network, and then uses the target neural network to generate the target compressed information, which can improve the performance of the compressed information corresponding to the entire current video sequence as much as possible.
  • the encoder generates indication information corresponding to the target compression information, where the indication information is used to indicate that the target compression information is obtained through the target neural network in the first neural network and the second neural network.
  • the encoder sends target compression information of the current video frame.
  • the encoder sends indication information corresponding to the target compression information of the current video frame.
  • steps 706 and 708 are mandatory steps.
  • steps 706 to 708 reference may be made to the description of steps 303 to 305 in the corresponding embodiment of FIG. 3 , which will not be repeated here. It should be noted that this embodiment of the present application does not limit the execution order of steps 707 and 708. Steps 707 and 708 may be executed simultaneously, or step 707 may be executed first, and then step 708 may be executed, or step 708 may be executed first, and then step 707 may be executed. .
  • the final compression information is selected from the first compression information and the second compression information.
  • the performance of the compressed information corresponding to the entire current video sequence can be improved as much as possible.
  • FIG. 8 is another schematic flowchart of a video frame compression method provided by an embodiment of the present application.
  • the video frame compression method provided by the embodiment of the present application may include:
  • the encoder compresses and encodes the third video frame through the first neural network to obtain first compression information corresponding to the third video frame, where the first compression information includes compression information of the first feature of the third video frame, and the first compression information includes the compression information of the first feature of the third video frame.
  • the reference frame of the three video frames is used for the compression process of the first feature of the third video frame.
  • the encoder when the encoder processes the third video frame in the current video frame, the encoder determines that the target compression information of the third video frame is the first compression information corresponding to the third video frame generated by the first neural network .
  • the third video frame is a video frame in the current video sequence, and the concept of the third video frame is similar to that of the current video frame.
  • the encoder compresses and encodes the fourth video frame through the second neural network to obtain second compression information corresponding to the fourth video frame, where the second compression information includes compression information of the second feature of the fourth video frame, and the first
  • the reference frame of the four video frames is used in the process of generating the second feature of the fourth video frame, and the third video frame and the fourth video frame are different video frames in the same video sequence.
  • the encoder when the encoder processes the fourth video frame in the current video frame, the encoder determines that the target compression information of the fourth video frame is the second compression information corresponding to the fourth video frame generated by the second neural network .
  • the fourth video frame is a video frame in the current video sequence, the concept of the fourth video frame is similar to that of the current video frame, and the third video frame and the fourth video frame are different video frames in the same current video sequence.
  • the meaning of the second feature of the fourth video frame please refer to the description of the meaning of the “second feature of the current video frame” in the corresponding embodiment of FIG. 3, the meaning of the “reference frame of the fourth video frame”, the encoder
  • the specific implementation of generating the second compression information corresponding to the fourth video frame, and the specific implementation of the encoder to determine the compression information of the fourth video frame that needs to be sent to the decoder at last can refer to the description in the corresponding embodiment of FIG. 3 . No further elaboration here.
  • Step 801 may be performed first, then step 802 may be performed, or step 802 may be performed first, and then step 801 may be performed, which needs to be determined in combination with the actual application scenario. , which is not limited here.
  • the encoder generates indication information, where the indication information is used to indicate that the first compressed information is obtained through the first neural network and the second compressed information is obtained through the second neural network.
  • the encoder may generate the target compression information corresponding to the one or more target compression information.
  • One-to-one corresponding indication information wherein the target compression information is embodied as the first compression information or the second compression information.
  • the encoder may first perform steps 801 and 802 multiple times, and then generate the indication information corresponding to the target compression information of each video frame in the entire current video sequence through step 803 one-to-one.
  • the encoder may also perform step 803 every time step 801 is performed or after step 802 is performed once.
  • the encoder may also perform step 803 once after performing step 801 and/or step 802 for a preset number of times, where the preset number of times is an integer greater than 1, for example, it may be 3, 4, 5 , 6 or other values, etc., are not limited here.
  • Step 803 is a mandatory step. If in step 801 or 802, the encoder obtains the target compression information of the current video frame (that is, the third video frame or the fourth video frame) in the manner shown in the corresponding embodiment of FIG. 3, then step 803 is optional step.
  • step 803 reference may be made to the description of step 303 in the corresponding embodiment of FIG. 3, which is not repeated here.
  • the encoder sends target compression information corresponding to the current video frame, where the target compression information is the first compression information or the second compression information.
  • the encoder is generating at least one first compression information corresponding to at least one third video frame one-to-one, and/or the encoder is generating at least one second compression information corresponding to at least one fourth video frame one-to-one.
  • at least one target compression information ie, the first video frame
  • at least one current video frame ie, the third video frame and/or the fourth compressed information and/or second compressed information.
  • FIG. 9 is a schematic diagram of a video frame compression method provided by an embodiment of the present application.
  • the encoder uses the third neural network to compress and encode some video frames in the current video sequence, and uses the fourth neural network to compress and encode another part of the video frames in the current video sequence, and then sends the same
  • the target compression information corresponding to all current video frames in the sequence, the target compression information is the first compression information or the second compression information, it should be understood that the example in FIG.
  • the encoder sends indication information corresponding to the current video frame.
  • step 805 is an optional step. If step 803 is not performed, step 805 is not performed, and if step 803 is performed, step 805 is performed. If step 805 is performed, step 805 and step 804 may be performed simultaneously.
  • step 805 reference may be made to the description of step 305 in the embodiment corresponding to FIG. 3 above, which is not repeated here.
  • the first compression information carries the compression information of the first feature of the current video frame
  • the The reference frame is only used in the compression process of the first feature of the current video frame, and is not used in the generation process of the first feature of the current video frame, so that the decoder performs a decompression operation according to the first compression information to obtain the first feature of the current video frame.
  • the reconstructed frame of the current video frame can be obtained without the reference frame of the current video frame, so when the target compression information is obtained through the first neural network, the quality of the reconstructed frame of the current video frame will not depend on the current video frame.
  • the quality of the reconstructed frame of the reference frame of the video frame thereby avoiding the accumulation of errors between frames, so as to improve the quality of the reconstructed frame of the video frame; when the fourth video frame is compressed and encoded by the second neural network, due to the The second feature of the four video frames is generated according to the reference frame of the fourth video frame.
  • the amount of data corresponding to the second compression information is smaller than that of the first compression information.
  • the first neural network and the second neural network are used. network to process different video frames in the current video sequence, in order to integrate the advantages of the first neural network and the second neural network, so as to reduce the amount of data to be transmitted as much as possible, and improve the reconstruction frame quality of the video frame. quality.
  • FIG. 10a is a schematic flowchart of a method for decompressing a video frame provided by an embodiment of the present application. Compression methods can include:
  • a decoder receives target compression information corresponding to at least one current video frame.
  • the encoder may send at least one target compression information corresponding to at least one current video frame in the current video sequence to the decoder; correspondingly, the decoder may receive information corresponding to the current video sequence.
  • One-to-one correspondence with at least one current video frame in at least one target compression information may be
  • the decoder may directly receive target compression information corresponding to at least one current video frame from the encoder; in another implementation manner, the decoder may also receive target compression information from an intermediate device such as a server or a management center At , target compression information corresponding to at least one current video frame is received.
  • the decoder receives indication information corresponding to the target compression information.
  • the decoder receives at least one indication information corresponding to at least one target compressed information.
  • the indication information reference may be made to the description in the corresponding embodiment of FIG. 3 , which is not repeated here.
  • step 1002 is an optional step. If step 1002 is executed, the embodiment of the present application does not limit the execution order of steps 1001 and 1002, and steps 1001 and 1002 may be executed simultaneously.
  • the decoder selects a target neural network corresponding to the current video frame from multiple neural networks, where the multiple neural networks include a third neural network and a fourth neural network.
  • the decoder after obtaining at least one target compression information corresponding to at least one current video frame, the decoder needs to perform a decompression operation by selecting a target neural network from multiple neural networks to obtain each reconstructed frame of the current video frame.
  • the plurality of neural networks include a third neural network and a fourth neural network, and both the third neural network and the fourth neural network are neural networks for performing decompression operations.
  • the third neural network corresponds to the first neural network, that is, if the target compression information of a current video frame is the first compression information of the current video frame obtained through the first neural network, the decoder needs to pass the third compression information.
  • the neural network performs a decompression operation on the first compressed information of the current video frame to obtain a reconstructed frame of the current video frame.
  • the fourth neural network corresponds to the second neural network, that is, if the target compression information of a current video frame is the second compression information of the current video frame obtained through the second neural network, the decoder needs to pass the fourth neural network to A decompression operation is performed on the second compressed information of the current video frame to obtain a reconstructed frame of the current video frame.
  • step 1002 the decoder can directly determine the target neural network corresponding to each target compressed information as the first target neural network according to the plurality of indication information corresponding to the plurality of target compressed information one-to-one. Which neural network of the first neural network and the second neural network.
  • FIG. 10b is another schematic flowchart of a method for decompressing a video frame provided by an embodiment of the present application.
  • the decoder obtains the target compression information corresponding to the current video frame and the indication information corresponding to the target compression information, it can obtain the target compression information corresponding to the target compression information, and can obtain the target compression information from the third neural network and the fourth neural network according to the indication information corresponding to the target compression information.
  • Determine the target neural network in and use the target neural network to decompress the target compression information corresponding to the current video frame to obtain the reconstructed frame of the current video frame.
  • the example in Figure 10b is only for the convenience of understanding this scheme, not used for Limited to this program.
  • the decoder may obtain the position information of the current video frame in the current video sequence corresponding to each target compression information one-to-one, and the position information is used to indicate the position information related to each target
  • the current video frame corresponding to the one-to-one compression information is the Xth frame in the current video sequence; the decoder selects the target neural network corresponding to the position information of the current video sequence from the third neural network and the fourth neural network according to the preset rule.
  • the preset rule may be to alternately select the third neural network or the fourth neural network according to a certain rule, that is, the decoder uses the third neural network to compress and encode n video frames of the current video frame, and then uses the fourth neural network. compressing and encoding the m video frames of the current video frame; or, after the encoder uses the fourth neural network to compress and encode the m video frames of the current video frame, and then uses the third neural network to compress and encode the n video frames of the current video frame Video frames are compressed and encoded.
  • the values of n and m may both be integers greater than or equal to 1, and the values of n and m may be the same or different.
  • the decoder selects a specific implementation manner of the target neural network corresponding to the position information of the current video sequence from a plurality of neural networks including the third neural network and the fourth neural network according to the preset rules, and the encoder selects the strategy according to the network,
  • the specific implementation of selecting the target neural network corresponding to the position information of the current video sequence from multiple neural networks including the first neural network and the second neural network is similar.
  • the difference is that the “first neural network” in the corresponding embodiment of FIG. "neural network” is replaced with “third neural network” in this embodiment, and “second neural network” in the corresponding embodiment of Fig. 3 is replaced with "fourth neural network” in this embodiment, you can directly refer to the above figure 3 corresponds to the description in the embodiment, and is not repeated here.
  • the decoder performs a decompression operation through the target neural network according to the target compression information to obtain a reconstructed frame of the current video frame, wherein, if the target neural network is the third neural network, the target compression information includes the first video frame of the current video frame.
  • the first compressed information of the feature, the reference frame of the current video frame is used for the decompression process of the first compressed information to obtain the first feature of the current video frame, and the first feature of the current video frame is used for the reconstruction of the current video frame.
  • Generation process if the target neural network is the fourth neural network, the target compression information includes the second compression information of the second feature of the current video frame, and the second compression information is used for the decoder to perform a decompression operation to obtain the current video frame.
  • the second feature, the reference frame of the current video frame and the second feature of the current video frame are used in the generation process of the reconstructed frame of the current video frame.
  • the target compression information includes the first compression information of the first feature of the current video frame
  • the third neural network includes an entropy decoding layer and a decoding network; wherein, through entropy decoding The layer performs an entropy decoding process of the first compressed information of the current video frame using the reference frame of the current video frame, and generates a reconstructed frame of the current video frame using the first feature of the current video frame through the decoding network.
  • the specific implementation of the decoder performing step 1004 can refer to the description of step 702 in the corresponding embodiment of FIG. 7a, the difference is that in step 702, the encoder The first compression information corresponding to the video frame is decompressed through the first neural network to obtain the reconstructed frame of the current video frame; in step 1004, the decoder is based on the first compression information corresponding to the current video frame, through the third neural network. The network performs decompression processing to obtain the reconstructed frame of the current video frame.
  • the target compression information includes the second compression information of the second feature of the current video frame
  • the fourth neural network includes an entropy decoding layer and a convolutional network
  • Entropy decoding is performed on the compressed information, and a process of generating a reconstructed frame of the current video frame is performed by using the reference frame of the current video frame and the second feature of the current video frame through a convolutional network.
  • the specific implementation of the decoder performing step 1004 can refer to the description of step 704 in the corresponding embodiment of FIG. 7a, the difference is that in step 704, the encoder The second compression information corresponding to the video frame is decompressed through the second neural network to obtain the reconstructed frame of the current video frame; in step 1004, the decoder passes the fourth neural network according to the second compression information corresponding to the current video frame. The network performs decompression processing to obtain the reconstructed frame of the current video frame.
  • FIG. 11 is a schematic flowchart of another method for decompressing a video frame provided by an embodiment of the present application.
  • Frame decompression methods can include:
  • the decoder receives target compression information corresponding to the current video frame, where the target compression information is the first compression information or the second compression information.
  • the decoder receives indication information corresponding to the current video frame, where the indication information is used to instruct the first compressed information to be decompressed by the third neural network and the second compressed information to be decompressed by the fourth neural network.
  • steps 1101 and 1102 for the specific implementation manner of steps 1101 and 1102, reference may be made to the description of steps 1001 and 1002 in the corresponding embodiment of FIG. 10a, and details are not described here.
  • the decoder decompresses the first compressed information of the third video frame through the third neural network to obtain a reconstructed frame of the third video frame.
  • the decoder selects the third neural network from multiple neural networks to decompress the first compressed information of the third video frame, "selects the first compressed information corresponding to the third video frame from the multiple neural networks"
  • the decoder selects the third neural network from multiple neural networks to decompress the first compressed information of the third video frame, "selects the first compressed information corresponding to the third video frame from the multiple neural networks"
  • the third neural network includes an entropy decoding layer and a decoding network.
  • the entropy decoding layer uses the reference frame of the current video frame to perform an entropy decoding process of the first compressed information of the current video frame
  • the decoding network uses the first feature of the current video frame. Generate a reconstructed frame of the current video frame.
  • the first compression information includes compression information of the first feature of the third video frame, and the reference frame of the third video frame is used in the decompression process of the first compression information to obtain the first feature of the third video frame, the third video frame.
  • the first feature of is used in the generation process of the reconstructed frame of the third video frame, and both the reconstructed frame of the third video frame and the reference frame of the third video frame are included in the current video sequence. That is, after decompressing the first compressed information, the decoder can obtain the reconstructed frame of the third video frame without using the reference frame of the third video frame.
  • the meaning of "the first feature of the third video frame” can be understood by referring to the above-mentioned meaning of "the first feature of the current video frame”
  • the meaning of "the reference frame of the third video frame” can be understood by referring to the above-mentioned “current video frame”.
  • the meaning of the "reference frame of the video frame” is understood, and will not be repeated here.
  • the reconstructed frame of the third video frame refers to a video frame corresponding to the third video frame obtained by performing a decompression operation using the first compression information.
  • the decoder decompresses the second compression information of the fourth video frame through the fourth neural network, so as to obtain a reconstructed frame of the fourth video frame.
  • the decoder selects the fourth neural network to decompress the first compressed information of the fourth video frame from multiple neural networks, and "selects the first compressed information corresponding to the fourth video frame from the multiple neural networks.
  • “Four Neural Networks” reference may be made to the description in step 1003 in the corresponding embodiment of FIG. 10a, which will not be repeated here.
  • the fourth neural network includes an entropy decoding layer and a convolutional network, entropy decoding is performed on the second compressed information through the entropy decoding layer, and the current video is executed by using the reference frame of the current video frame and the second feature of the current video frame through the convolutional network.
  • Frame Reconstruction Frame generation process For a specific implementation manner of the decoder decompressing the second compressed information of the fourth video frame through the fourth neural network, reference may be made to the description of step 704 in the corresponding embodiment of FIG. 7a, and details are not repeated here.
  • the second compression information includes compression information of the second feature of the fourth video frame, and the second compression information is used for the decoder to perform a decompression operation to obtain the second feature of the fourth video frame, the reference frame of the fourth video frame and the second feature of the fourth video frame.
  • the second feature of the four video frames is used in the process of generating the reconstructed frame of the fourth video frame, and both the reconstructed frame of the fourth video frame and the reference frame of the fourth video frame are included in the current video sequence.
  • the meaning of “the second feature of the fourth video frame” can be understood by referring to the above-mentioned meaning of “the second feature of the current video frame”
  • the meaning of “the reference frame of the fourth video frame” can be understood by referring to the above-mentioned “current video frame”.
  • the meaning of the "reference frame of the video frame” is understood, and will not be repeated here.
  • the reconstructed frame of the fourth video frame refers to a video frame corresponding to the fourth video frame obtained by performing a decompression operation using the second compression information.
  • FIG. 12 is a schematic flowchart of a training method for a video frame compression and decompression system provided by an embodiment of the present application.
  • the video frame compression and decompression system training method provided by the embodiment of the present application may include:
  • the training device compresses and encodes the first training video frame by using the first neural network, so as to obtain first compression information corresponding to the first training video frame.
  • a training data set is pre-stored in the training device, and the training data set includes a plurality of first training video frames.
  • step 1201 please refer to the description of step 801 in the corresponding embodiment of FIG. 8 , here I won't go into details.
  • the training device does not need to The step of selecting the target neural network in the second neural network, or in other words, in step 1201, the training device does not need to perform the step of selecting the target compressed information from the first compressed information and the second compressed information.
  • the training device decompresses the first compression information of the first training video frame by using a third neural network to obtain a first training reconstruction frame.
  • step 1202 for the specific implementation manner of the training device performing step 1202, reference may be made to the description of step 1103 in the corresponding embodiment of FIG. 11 , which is not repeated here. The difference is that, firstly, the "third video frame" in step 1103 is replaced with the "first training video frame” in this embodiment; secondly, in step 1202, the training device does not need to perform the transformation from the third neural network to the first training video frame. The steps of selecting the target neural network in the four neural networks.
  • the training device trains the first neural network and the third neural network according to the first training video frame, the first training reconstruction frame, the first compression information, and the first loss function until the preset conditions are met.
  • the training device may, according to the first training video frame, the first training reconstruction frame, and the first compression information corresponding to the first training video frame, use the first loss function to analyze the first neural network and the third neural network.
  • the network is iteratively trained until the convergence condition of the first loss function is satisfied.
  • the first loss function includes the loss item of the similarity between the first training video frame and the first training reconstruction frame and the loss item of the data size of the first compression information of the first training video frame
  • the first training reconstruction frame is Reconstructed frame of the first training video frame.
  • the training objective of the first loss function includes bridging the similarity between the first training video frame and the first training reconstruction frame.
  • the training objective of the first loss function further includes reducing the size of the first compressed information of the first training video frame.
  • the first neural network refers to a neural network used in the process of compressing and encoding video frames; the second neural network refers to a neural network that performs decompression operations based on compressed information.
  • the training device can calculate the function value of the first loss function according to the first training video frame, the first training reconstruction frame, and the first compression information corresponding to the first training video frame, and according to the function value of the first loss function
  • the gradient value is generated, and then the weight parameters of the first neural network and the third neural network are updated in the reverse direction, so as to complete the training of the first neural network and the third neural network. Iterative training of the first neural network and the third neural network.
  • the training device compresses and encodes the second training video frame through the second neural network according to the reference frame of the second training video frame, to obtain second compression information corresponding to the second training video frame, the reference of the second training video frame Frames are video frames processed by the first neural network after training.
  • step 1202 for a specific implementation manner of performing step 1202 by the training device, reference may be made to the description of step 802 in the corresponding embodiment of FIG. 8 , which will not be repeated here.
  • the difference is that, first, the "fourth video frame" in step 802 is replaced with the "second training video frame" in this embodiment; second, in step 1204, the training device does not need to The step of selecting the target neural network in the second neural network, or in other words, in step 1204, the training device does not need to perform the step of selecting the target compressed information from the first compressed information and the second compressed information.
  • the reference frame of the second training video frame may be the original video frame in the training data set, or may be the video frame processed by the mature first neural network (that is, the first neural network that has performed the training operation). .
  • the training device can input the original reference frame of the second training video frame into the mature third In the first encoding network of a neural network (that is, the first neural network that has performed the training operation), the encoding operation is performed on the second training video frame to obtain the encoding result, and the aforementioned encoding result is input into the mature third neural network In the first decoding network (that is, the third neural network that has performed the training operation), the decoding operation is performed on the encoding result to obtain the processed reference frame of the second training video frame. Further, the training device inputs the processed reference frame of the second training video frame and the second training video frame into the second neural network, so as to generate the second compression information corresponding to the second training video frame through the second neural network.
  • the training device may input the original reference frame of the second training video frame into the mature first neural network, so as to generate the original reference frame of the second training video frame through the mature first neural network.
  • Corresponding first compression information and use a mature third neural network to perform a decompression operation according to the first compression information corresponding to the original reference frame of the second training video frame, and obtain the processed reference of the second training video frame. frame.
  • the training device inputs the processed reference frame of the second training video frame and the second training video frame into the second neural network, so as to generate the second compression information corresponding to the second training video frame through the second neural network.
  • the reference frame used by the second neural network since in the execution stage, the reference frame used by the second neural network may be processed by the first neural network, the reference frame processed by the first neural network is used to execute the execution on the second neural network.
  • the training operation is conducive to maintaining the consistency of the training phase and the execution phase, so as to improve the accuracy of the execution phase.
  • the training device decompresses the second compression information of the second training video frame through the fourth neural network to obtain the second training reconstruction frame.
  • step 1202 for a specific implementation manner of the training device performing step 1202, reference may be made to the description of step 1104 in the corresponding embodiment of FIG. 11 , which will not be repeated here. The difference is that, first, the "fourth video frame" in step 1104 is replaced with the "second training video frame" in this embodiment; secondly, in step 1205, the training device does not need to perform the transformation from the third neural network and the first The steps of selecting the target neural network in the four neural networks.
  • the training device trains the second neural network and the fourth neural network according to the second training video frame, the second training reconstruction frame, the second compression information, and the second loss function until the preset conditions are met.
  • the training device may, according to the second training video frame, the second training reconstruction frame, and the second compression information corresponding to the second training video frame, use the second loss function to analyze the second neural network and the fourth neural network.
  • the network is iteratively trained until the convergence condition of the second loss function is satisfied.
  • the second loss function includes the loss item of the similarity between the second training video frame and the second training reconstruction frame and the loss item of the data size of the second compression information of the second training video frame
  • the second training reconstruction frame is Reconstructed frame of the second training video frame.
  • the training objective of the second loss function includes bridging the similarity between the second training video frame and the second training reconstruction frame.
  • the training objective of the second loss function further includes reducing the size of the second compression information of the second training video frame.
  • the second neural network refers to a neural network used in the process of compressing and encoding video frames; the fourth neural network refers to a neural network that performs decompression operations based on compressed information.
  • the training device can calculate the function value of the second loss function according to the second training video frame, the second training reconstruction frame and the second compression information corresponding to the second training video frame, and according to the function value of the second loss function
  • the gradient value is generated, and then the weight parameters of the second neural network and the fourth neural network are updated in the reverse direction, so as to complete the training of the second neural network and the fourth neural network. Iterative training of the second neural network and the fourth neural network.
  • the independent neural network module refers to a neural network module with independent functions.
  • the first coding network in the first neural network is an independent neural network module.
  • the second neural network The first decoding network in is an independent neural network module.
  • the first neural network after training and the third neural network after training can be used first.
  • the neural network initializes the parameters of the second neural network and the fourth neural network, that is, assigns the parameters in the first neural network after training and the third neural network after training to the same neural network module described above, and sets the parameters in the second neural network.
  • the training process of the second neural network and the fourth neural network keep the parameters of the same neural network module above unchanged, and adjust the parameters of the second neural network and the remaining neural network modules in the fourth neural network to reduce the second neural network and the fourth neural network.
  • the total duration of the training process of the fourth neural network improves the training efficiency of the second neural network and the fourth neural network.
  • the second neural network is used to perform the compression operation on a video frame as an example, and the experimental data is shown in Table 1 below.
  • FIG. 13 is a system architecture diagram of the video encoding and decoding system provided by the embodiment of the present application, and FIG. 13 is an exemplary video encoding and decoding system.
  • a schematic block diagram of system 10, video encoder 20 (or encoder 20 for short) and video decoder 30 (or decoder 30 for short) in video codec system 10 represent Examples of devices that perform each technique, etc.
  • the video codec system 10 includes a source device 12 for supplying encoded image data 21 such as encoded images to a destination device 14 for decoding the encoded image data 21.
  • the source device 12 includes an encoder 20 and, alternatively, an image source 16 , a preprocessor (or preprocessing unit) 18 such as an image preprocessor, and a communication interface (or communication unit) 22 .
  • Image source 16 may include or be any type of image capture device for capturing real-world images, etc., and/or any type of image generation device, such as a computer graphics processor or any type of user for generating computer animation images. Devices used to acquire and/or provide real-world images, computer-generated images (e.g., screen content, virtual reality (VR) images, and/or any combination thereof (e.g., augmented reality, AR) images).
  • the image source may be any type of memory or storage that stores any of the above-mentioned images.
  • the image (or image data 17 ) may also be referred to as the original image (or original image data) 17 .
  • the preprocessor 18 is used to receive the (raw) image data 17 and preprocess the image data 17 to obtain a preprocessed image (or preprocessed image data) 19 .
  • the preprocessing performed by the preprocessor 18 may include trimming, color format conversion (eg, from RGB to YCbCr), toning, or denoising. It is understood that the preprocessing unit 18 may be an optional component.
  • a video encoder (or encoder) 20 is used to receive preprocessed image data 19 and provide encoded image data 21 .
  • the communication interface 22 in the source device 12 can be used to: receive the encoded image data 21 and send the encoded image data 21 (or any other processed version) over the communication channel 13 to another device such as the destination device 14 or any other device for storage or rebuild directly.
  • the destination device 14 includes a decoder 30 and may additionally, alternatively, include a communication interface (or communication unit) 28 , a post-processor (or post-processing unit) 32 and a display device 34 .
  • the communication interface 28 in the destination device 14 is used to receive the encoded image data 21 (or any other processed version) directly from the source device 12 or from any other source device such as a storage device, for example, the storage device is an encoded image data storage device, The encoded image data 21 is supplied to the decoder 30 .
  • Communication interface 22 and communication interface 28 may be used through a direct communication link between source device 12 and destination device 14, such as a direct wired or wireless connection, etc., or through any type of network, such as a wired network, a wireless network, or any Combination, any type of private network and public network, or any type of combination, send or receive encoded image data (or encoded data) 21 .
  • the communication interface 22 may be used to encapsulate the encoded image data 21 into a suitable format such as a message, and/or use any type of transfer encoding or processing to process the encoded image data for transmission over a communication link or communication network transfer on.
  • the communication interface 28 corresponds to the communication interface 22 and may be used, for example, to receive transmission data and process the transmission data using any type of corresponding transmission decoding or processing and/or decapsulation to obtain encoded image data 21 .
  • Both the communication interface 22 and the communication interface 28 can be configured as a one-way communication interface as indicated by the arrow in FIG. 13 from the corresponding communication channel 13 of the source device 12 to the destination device 14, or a two-way communication interface, and can be used to send and receive messages etc. to establish a connection, acknowledge and exchange any other information related to a communication link and/or data transfer such as encoded image data transfer, etc.
  • the video decoder (or decoder) 30 is configured to receive the encoded image data 21 and provide decoded image data (or decoded image data) 31, wherein the decoded image data may also be referred to as reconstructed image data, a reconstructed frame of a video frame, or a video frame.
  • decoded image data may also be referred to as reconstructed image data, a reconstructed frame of a video frame, or a video frame.
  • Other names and the like refer to image data obtained by performing a decompression operation based on the encoded image data 21 .
  • the post-processor 32 is configured to perform post-processing on the decoded image data 31 such as the decoded image to obtain post-processed image data 33 such as the post-processed image.
  • Post-processing performed by post-processor 32 may include, for example, color format conversion (eg, from YCbCr to RGB), toning, trimming, or resampling, or any other processing used to generate decoded image data 31 for display by display device 34, etc. .
  • a display device 34 is used to receive post-processed image data 33 to display the image to a user or viewer or the like.
  • Display device 34 may be or include any type of display for representing the reconstructed image, eg, an integrated or external display screen or display.
  • the display screen may include a liquid crystal display (LCD), an organic light emitting diode (OLED) display, a plasma display, a projector, a micro LED display, a liquid crystal on silicon (LCoS) display ), digital light processor (DLP), or any other type of display.
  • the video encoding and decoding system 10 further includes a training engine 25, and the training engine 25 is used to train the neural network in the encoder 20 or the decoder 30, that is, the first neural network, the second Three neural network and fourth neural network.
  • the training data may be stored in a database (not shown), and the training engine 25 trains the neural network based on the training data. It should be noted that the embodiments of the present application do not limit the source of the training data, for example, the training data may be obtained from the cloud or other places to perform neural network training.
  • the neural network trained by the training engine 25 can be applied to the video codec system 10 and the video codec system 40, for example, applied to the source device 12 (such as the encoder 20) or the destination device 14 (such as the decoder shown in FIG. 13 ) 30).
  • the training engine 25 can train the above-mentioned neural network in the cloud, and then the video encoding and decoding system 10 downloads and uses the neural network from the cloud.
  • FIG. 13 shows the source device 12 and the destination device 14 as independent devices
  • the device embodiment may also include the source device 12 and the destination device 14 or the functions of the source device 12 and the destination device 14 at the same time, that is, the source device 12 and the destination device 14 at the same time.
  • Device 12 or corresponding function and destination device 14 or corresponding function In these embodiments, source device 12 or corresponding functionality and destination device 14 or corresponding functionality may be implemented using the same hardware and/or software or by separate hardware and/or software, or any combination thereof.
  • the existence and (exact) division of the different units or functions in the source device 12 and/or the destination device 14 shown in FIG. 13 may vary based on the actual device and application, as will be apparent to the skilled person .
  • FIG. 14 is another system architecture diagram of a video encoding and decoding system provided by an embodiment of the present application.
  • the encoder 20 eg, the video encoder 20
  • the decoder 30 eg, the video encoder 20
  • the decoder 30 eg, the video The decoder 30
  • a processing circuit as shown in FIG. 14, such as one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (application-specific integrated circuits) , ASIC), field programmable gate array (field-programmable gate array, FPGA), discrete logic, hardware, special purpose processors for video encoding, or any combination thereof.
  • DSPs digital signal processors
  • ASIC application-specific integrated circuits
  • FPGA field programmable gate array
  • Encoder 20 may be implemented by processing circuitry 46 to include the various modules discussed with reference to encoder 20 of FIG. 14 and/or any other decoder system or subsystem described herein.
  • Decoder 30 may be implemented by processing circuit 46 to include the various modules discussed with reference to decoder 30 of FIG. 15 and/or any other decoder system or subsystem described herein.
  • the processing circuitry 46 may be used to perform various operations discussed below. As shown in Figure 16, if parts of the techniques are implemented in software, a device may store the instructions of the software in a suitable non-transitory computer-readable storage medium and execute the instructions in hardware using one or more processors, thereby Implement the techniques of this application.
  • One of the video encoder 20 and the video decoder 30 may be integrated in a single device as part of a combined codec (encoder/decoder, CODEC), as shown in FIG. 14 .
  • Source device 12 and destination device 14 may include any of a variety of devices, including any type of handheld or stationary device, such as a laptop or laptop, cell phone, smartphone, tablet or tablet, camera, Desktop computers, set-top boxes, televisions, display devices, digital media players, video game consoles, video streaming devices (eg, content service servers or content distribution servers), broadcast receiving equipment, broadcast transmitting equipment, etc., and may not Use or use any type of operating system.
  • source device 12 and destination device 14 may be equipped with components for wireless communication.
  • source device 12 and destination device 14 may be wireless communication devices.
  • the video codec system 10 shown in FIG. 13 is merely exemplary, and the techniques provided in this application may be applicable to video encoding settings (eg, video encoding or video decoding) that do not necessarily include encoding devices and Decode any data communication between devices.
  • data is retrieved from local storage, sent over a network, and so on.
  • the video encoding device may encode and store the data in memory, and/or the video decoding device may retrieve and decode the data from the memory.
  • encoding and decoding are performed by devices that do not communicate with each other but merely encode data to and/or retrieve and decode data from memory.
  • Video codec system 40 may include imaging device 41, video encoder 20, video decoder 30 (and/or video codec implemented by processing circuit 46), antenna 42, one or more processors 43, a or multiple memory stores 44 and/or display devices 45 .
  • the imaging device 41, the antenna 42, the processing circuit 46, the video encoder 20, the video decoder 30, the processor 43, the memory storage 44 and/or the display device 45 can communicate with each other.
  • video codec system 40 may include only video encoder 20 or only video decoder 30 .
  • antenna 42 may be used to transmit or receive an encoded bitstream of video data.
  • display device 45 may be used to present video data.
  • Processing circuitry 46 may include application-specific integrated circuit (ASIC) logic, graphics processors, general purpose processors, and the like.
  • the video codec system 40 may also include an optional processor 43, which may similarly include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, and the like.
  • the memory memory 44 may be any type of memory, such as volatile memory (eg, static random access memory (SRAM), dynamic random access memory (DRAM), etc.) or non-volatile memory volatile memory (eg, flash memory, etc.), etc.
  • memory storage 44 may be implemented by cache memory.
  • processing circuitry 46 may include memory (eg, cache memory, etc.) for implementing image buffers, and the like.
  • video encoder 20 implemented by logic circuitry may include an image buffer (eg, implemented by processing circuitry 46 or memory memory 44 ) and a graphics processing unit (eg, implemented by processing circuitry 46 ).
  • the graphics processing unit may be communicatively coupled to the image buffer.
  • the graphics processing unit may include video encoder 20 implemented by processing circuitry 46 to implement the various modules discussed with reference to video decoder 20 of FIG. 14 and/or any other encoder systems or subsystems described herein.
  • Logic circuits may be used to perform the various operations discussed herein.
  • video decoder 30 may be implemented by processing circuitry 46 in a similar manner to implement various of the types discussed with reference to video decoder 30 of FIG. 14 and/or any other decoder systems or subsystems described herein module.
  • logic circuit-implemented video decoder 30 may include an image buffer (implemented by processing circuit 46 or memory memory 44) and a graphics processing unit (eg, implemented by processing circuit 46).
  • the graphics processing unit may be communicatively coupled to the image buffer.
  • the graphics processing unit may include video decoder 30 implemented by processing circuitry 46 .
  • antenna 42 may be used to receive an encoded bitstream of video data.
  • the encoded bitstream may include data, indicators, index values, mode selection data, etc., as discussed herein related to encoded video frames, such as data related to encoded partitions (eg, transform coefficients or quantized transform coefficients). , (as discussed) optional indicators, and/or data defining the encoding split).
  • Video codec system 40 may also include video decoder 30 coupled to antenna 42 for decoding the encoded bitstream.
  • Display device 45 is used to present video frames.
  • video decoder 30 may be used to perform the opposite process.
  • video decoder 30 may be operable to receive and parse such syntax elements, decoding the associated video data accordingly.
  • video encoder 20 may entropy encode the syntax elements into an encoded video bitstream. In such instances, video decoder 30 may parse such syntax elements and decode related video data accordingly.
  • codec process described in this application exists in most video codecs, such as H.263, H.264, MPEG-2, MPEG-4, VP8, VP9, AI-based end-to-end In the corresponding codec such as the image encoding of the terminal.
  • FIG. 15 is a schematic diagram of a video decoding apparatus 400 provided by an embodiment of the present application.
  • Video coding apparatus 400 is suitable for implementing the disclosed embodiments described herein.
  • the video coding apparatus 400 may be a decoder, such as the video decoder 30 in FIG. 14 , or an encoder, such as the video encoder 20 in FIG. 14 .
  • the video decoding apparatus 400 includes: an input port 410 (or input port 410) for receiving data and a receiver unit (receiver unit, Rx) 420; a processor, a logic unit or a central processing unit (central processing unit) for processing data , CPU) 430; for example, the processor 430 here can be a neural network processor 430; a transmitter unit (transmitter unit, Tx) 440 for transmitting data and an output port 450 (or output port 450); memory 460.
  • the video coding apparatus 400 may also include optical-to-electrical (OE) components and electrical-to-optical (EO) components coupled to the input port 410, the receiving unit 420, the transmitting unit 440, and the output port 450, Exit or entrance for optical or electrical signals.
  • OE optical-to-electrical
  • EO electrical-to-optical
  • the processor 430 is implemented by hardware and software.
  • Processor 430 may be implemented as one or more processor chips, cores (eg, multi-core processors), FPGAs, ASICs, and DSPs.
  • the processor 430 communicates with the ingress port 410 , the receiving unit 420 , the sending unit 440 , the egress port 450 and the memory 460 .
  • the processor 430 includes a decoding module 470 (eg, a neural network NN based decoding module 470).
  • the decoding module 470 implements the embodiments disclosed above. For example, the transcoding module 470 performs, processes, prepares or provides various encoding operations.
  • decoding module 470 is implemented as instructions stored in memory 460 and executed by processor 430 .
  • Memory 460 includes one or more magnetic disks, tape drives, and solid-state drives, and may serve as an overflow data storage device for storing programs when such programs are selected for execution, and for storing instructions and data read during program execution.
  • Memory 460 may be volatile and/or non-volatile, and may be read-only memory (ROM), random access memory (RAM), ternary content addressable memory (ternary) content-addressable memory, TCAM) and/or static random-access memory (SRAM).
  • FIG. 16 is a simplified block diagram of an apparatus 500 provided by an exemplary embodiment.
  • the apparatus 500 can be used as either or both of the source device 12 and the destination device 14 in FIG. 13 .
  • the processor 502 in the apparatus 500 may be a central processing unit.
  • the processor 502 may be any other type of device or devices, existing or to be developed in the future, capable of manipulating or processing information.
  • the disclosed implementations may be implemented using a single processor, such as processor 502 as shown, using more than one processor is faster and more efficient.
  • the memory 504 in the apparatus 500 may be a read only memory (ROM) device or a random access memory (RAM) device. Any other suitable type of storage device may be used as memory 504 .
  • Memory 504 may include code and data 506 accessed by processor 502 via bus 512 .
  • the memory 504 may also include an operating system 508 and application programs 510 including at least one program that allows the processor 502 to perform the methods described herein.
  • applications 510 may include applications 1 through N, and also include video coding applications that perform the methods described herein.
  • Apparatus 500 may also include one or more output devices, such as display 518 .
  • display 518 may be a touch-sensitive display that combines a display with touch-sensitive elements that may be used to sense touch input.
  • Display 518 may be coupled to processor 502 through bus 512 .
  • bus 512 in device 500 is described herein as a single bus, bus 512 may include multiple buses. Additionally, secondary storage may be directly coupled to other components of device 500 or accessed through a network, and may include a single integrated unit, such as a memory card, or multiple units, such as multiple memory cards. Accordingly, apparatus 500 may have various configurations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请实施例公开了一种视频帧的压缩和视频帧的解压缩方法及装置,方法包括:根据网络选择策略从多个神经网络中确定目标神经网络,通过目标神经网络生成与当前视频帧对应的压缩信息,若压缩信息通过第一神经网络得到,则压缩信息包括当前视频帧的第一特征的第一压缩信息,当前视频帧的参考帧用于当前视频帧的第一特征的压缩过程;若压缩信息通过第二神经网络得到,则压缩信息包括当前视频帧的第二特征的第二压缩信息,当前视频帧的参考帧用于当前视频帧的第二特征的生成过程。当压缩信息通过第一神经网络得到时,避免了误差在逐帧之间累积,以提高视频帧的重建帧的质量;综合第一神经网络和第二神经网络的优点,以尽量减少需要传输的数据量。

Description

一种视频帧的压缩和视频帧的解压缩方法及装置
本申请要求于2020年11月13日提交中国专利局、申请号为202011271217.8、发明名称为“一种压缩编码、解压缩方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种视频帧的压缩和视频帧的解压缩方法及装置。
背景技术
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
目前,基于深度学习(deep learning)的神经网络对视频帧进行压缩是人工智能常见的一个应用方式。具体的,编码器通过神经网络计算当前视频帧相对于当前视频帧的参考帧的光流,生成原始的当前视频视频帧相对于参考帧的光流,将前述光流进行压缩编码,得到压缩后的光流,当前视频帧的参考帧和当前视频帧均归属于当前视频序列,当前视频帧的参考帧为对当前视频帧进行压缩编码时需要参考的视频帧。对该压缩后的光流进行解压缩得到解压缩后的光流,根据解压缩后的光流和参考帧,生成预测的当前视频帧,通过神经网络计算原始的当前视频帧与预测的当前视频帧之间的残差,对前述残差进行压缩编码。将压缩后的光流和压缩后的残差发送给解码器,从而解码器可以根据解压缩后的参考帧、解压缩后的光流以及解压缩后的残差,通过神经网络得到解压缩后的当前视频帧。
由于在上述通过神经网络得到解压缩后的视频帧的过程中,过于依赖解压缩后的参考帧的质量,且误差会逐帧累积,因此,一种提高视频帧的重建帧的质量的方案亟待推出。
发明内容
本申请提供了一种视频帧的压缩和视频帧的解压缩方法及装置,当压缩信息通过第一神经网络得到时,当前视频帧的重建帧的质量不会依赖于当前视频帧的参考帧的重建帧的质量,进而避免了误差在逐帧之间累积,以提高视频帧的重建帧的质量;此外,综合第一神经网络和第二神经网络的优点,以实现在尽量减少需要传输的数据量的基础上,提高视频帧的重建帧的质量。
为解决上述技术问题,本申请提供以下技术方案:
第一方面,本申请提供一种视频帧的压缩方法,可将人工智能技术应用于视频帧编解码领域中。方法可以包括:编码器根据网络选择策略从多个神经网络中确定目标神经网络, 多个神经网络包括第一神经网络和第二神经网络;通过目标神经网络对当前视频帧进行压缩编码,以得到与当前视频帧对应的压缩信息。
其中,若压缩信息通过第一神经网络得到,则压缩信息包括当前视频帧的第一特征的第一压缩信息,当前视频帧的参考帧用于当前视频帧的第一特征的压缩过程,且当前视频帧的参考帧不用于当前视频帧的第一特征的生成过程;也即当前视频帧的第一特征是仅基于当前视频帧就能得到的,当前视频帧的第一特征的生成过程中不需要借助当前视频帧的参考帧。若压缩信息通过第二神经网络得到,则压缩信息包括当前视频帧的第二特征的第二压缩信息,当前视频帧的参考帧用于当前视频帧的第二特征的生成过程。
当前视频帧为包括于当前视频序列中的原始视频帧;当前视频帧的参考帧可以是当前视频序列中的原始视频帧,也可以不是当前视频序列中的原始视频帧。当前视频帧的参考帧可以是通过编码网络对原始的参考帧进行变换编码,再通过解码网络进行逆变换解码后得到的视频帧;或者,当前视频帧的参考帧是编码器对原始的参考帧进行压缩编码并进行解压缩后得到的视频帧。
本实现方式中,由于当压缩信息通过第一神经网络得到时,压缩信息携带的是当前视频帧的第一特征的压缩信息,而当前视频帧的参考帧仅用于当前视频帧的第一特征的压缩过程,不用于当前视频帧的第一特征的生成过程,从而解码器在根据第一压缩信息执行解压缩操作以得到当前视频帧的第一特征后,不需要借助当前视频帧的参考帧就能够得到当前视频帧的重建帧,所以当压缩信息通过第一神经网络得到时,该当前视频帧的重建帧的质量不会依赖于该当前视频帧的参考帧的重建帧的质量,进而避免了误差在逐帧之间累积,以提高视频帧的重建帧的质量;此外,由于当前视频帧的第二特征是根据当前视频帧的参考帧生成的,第二特征的第二压缩信息所对应的数据量比第一特征的第一压缩信息所对应的数据量小,编码器可以利用第一神经网络和第二神经网络,来对当前视频序列中不同的视频帧进行处理,以综合第一神经网络和第二神经网络的优点,以实现在尽量减少需要传输的数据量的基础上,提高视频帧的重建帧的质量。
在第一方面的一种可能实现方式中,第一神经网络包括编码(Encoding)网络和熵编码层,其中,通过编码网络从当前视频帧中获取当前视频帧的第一特征;通过熵编码层对当前视频帧的第一特征进行熵编码,以输出第一压缩信息。进一步地,当前视频帧的第一特征为通过第一编码网络对当前视频帧进行变换编码,在进行变换编码之后会再进行量化后得到的。
在第一方面的一种可能实现方式中,第二神经网络包括卷积网络和熵编码层,卷积网络包括多个卷积层和激励ReLU层,其中,通过卷积网络利用所述当前视频帧的参考帧得到所述当前视频帧的残差,通过熵编码层对当前视频帧的残差进行熵编码处理,以输出第二压缩信息。
在第一方面的一种可能实现方式中,若压缩信息通过第二神经网络得到,则编码器通过目标神经网络对当前视频帧进行压缩编码,以得到与当前视频帧对应的压缩信息,可以包括:编码器生成原始的当前视频帧相对于当前视频帧的参考帧的光流,将前述光流进行压缩编码,得到压缩后的光流,其中,当前视频帧的第二特征包括原始的当前视频帧相对 于当前视频帧的参考帧的光流。
可选地,编码器还可以对该压缩后的光流进行解压缩得到解压缩后的光流,根据解压缩后的光流和当前视频帧的参考帧,生成预测的当前视频帧;编码器计算原始的当前视频帧与预测的当前视频帧之间的残差;其中,当前视频帧的第二特征包括原始的当前视频帧相对于当前视频帧的参考帧的光流和原始的当前视频帧与预测的当前视频帧之间的残差。
在第一方面的一种可能实现方式中,网络选择策略与如下任一种或多种因素相关:当前视频帧的位置信息或当前视频帧所携带的数据量。
在第一方面的一种可能实现方式中,编码器根据网络选择策略从多个神经网络中确定目标神经网络,包括:编码器获取当前视频帧在当前视频序列中的位置信息,其中,位置信息用于指示当前视频帧为当前视频序列的第X帧,当前视频帧在当前视频序列中的位置信息具体可以表现为索引号,该索引号具体可以表现为字符串的形式。编码器根据位置信息,从多个神经网络中选取目标神经网络。或者,编码器根据网络选择策略从多个神经网络中确定目标神经网络,包括:编码器根据当前视频帧的属性,从多个神经网络中选取目标神经网络,其中,当前视频帧的属性用于反映当前视频帧所携带的数据量,当前视频帧的属性包括以下中的任一种或多种的组合:当前视频帧的熵、对比度和饱和度。
本实现方式中,根据当前视频帧在当前视频序列中的位置信息,从多个神经网络中选取目标神经网络;或者,可以根据当前视频的至少一种属性,从多个神经网络中选取目标神经网络,进而能够利用目标神经网络生成当前视频帧的压缩信息,提供了多种简单、易操作的实现方案,提高了本方案的实现灵活性。
在第一方面的一种可能实现方式中,方法还可以包括:编码器生成并发送与一个或多个压缩信息一一对应的至少一个指示信息。其中,每个指示信息用于指示一个压缩信息通过第一神经网络和第二神经网络中的目标神经网络得到,也即该一个指示信息用于指示一个压缩信息是通过第一神经网络和第二神经网络中的哪一个神经网络得到的。
本实现方式中,解码器能够获取到与多个压缩信息对应的多个指示信息,从而解码器能够得知当前视频序列中的每个视频帧是采用第一神经网络和第二神经网络中的哪个神经网络来执行解压缩操作,有利于提高解码器对压缩信息进行解码的时间,也即有利于提高整个编码器和解码器进行视频帧传输的效率。
在第一方面的一种可能实现方式中,若压缩信息通过第一神经网络得到,则编码器通过目标神经网络对当前视频帧进行压缩编码,以得到与当前视频帧对应的压缩信息,可以包括:编码器通过编码网络从所述当前视频帧中获取当前视频帧的第一特征;通过熵编码层根据当前视频帧的参考帧,对当前视频帧的特征进行预测,以生成当前视频帧的预测特征;其中,当前视频帧的预测特征为当前视频帧的第一特征的预测结果,当前视频帧的预测特征和当前视频帧的第一特征的数据形状相同。编码器根据通过熵编码层当前视频帧的预测特征,生成当前视频帧的第一特征的概率分布;当前视频帧的第一特征的概率分布包括当前视频帧的第一特征的均值和当前视频帧的第一特征的方差。编码器通过熵编码层根据当前视频帧的第一特征的概率分布,对当前视频帧的第一特征进行熵编码,得到第一压缩信息。
本实现方式中,由于编码器为根据当前视频帧的预测特征生成当前视频帧的第一特征的概率分布,进而根据当前视频帧的第一特征的概率分布,对当前视频帧的第一特征进行压缩编码,从而得到当前视频帧的第一压缩信息,由于当前视频帧的预测特征与第一特征之间的相似度越高,对第一特征的压缩率就会越大,最后得到的第一压缩信息就会越小,而当前视频帧的预测特征为根据当前视频帧的参考帧,对当前视频帧的特征进行预测得到的,以提高当前视频帧的预测特征与当前视频帧的第一特征之间的相似度,从而能够降低压缩后的第一压缩信息的大小,也即不仅能够保证解码器获得的重建帧的质量,也能减少编码器与解码器传输的数据量的大小。
在第一方面的一种可能实现方式中,第一神经网络和第二神经网络均为执行过训练操作的神经网络,第一神经网络的模型参数是根据第一神经网络的第一损失函数进行更新的。其中,第一损失函数包括第一训练视频帧和第一训练重建帧之间的相似度的损失项和第一训练视频帧的压缩信息的数据大小的损失项,第一训练重建帧为第一训练视频帧的重建帧;第一损失函数的训练目标包括拉近第一训练视频帧和第一训练重建帧之间的相似度,还包括减小第一训练视频帧的第一压缩信息的大小。在根据一个或多个第二训练视频帧、第二训练视频帧的参考帧和第二损失函数对第二神经网络进行训练的过程中,第二损失函数包括第二训练视频帧和第二训练重建帧之间的相似度的损失项和第二训练视频帧的压缩信息的数据大小的损失项。其中,第二训练重建帧为第二训练视频帧的重建帧,第二训练视频帧的参考帧为经过第一神经网络处理过的视频帧;第二损失函数的训练目标包括拉近第二训练视频帧和第二训练重建帧之间的相似度,还包括减小第二训练视频帧的第二压缩信息的大小。
本实现方式中,由于在执行阶段,第二神经网络所采用的参考帧可能是经过第一神经网络处理过的,则采用由第一神经网络处理过的参考帧来对第二神经网络执行训练操作,有利于保持训练阶段和执行阶段的一致性,以提高执行阶段的准确率。
第二方面,本申请实施例提供了一种视频帧的压缩方法,可将人工智能技术应用于视频帧编解码领域中。编码器通过第一神经网络对当前视频帧进行压缩编码,以得到当前视频帧的第一特征的第一压缩信息,当前视频帧的参考帧用于当前视频帧的第一特征的压缩过程;通过第一神经网络生成第一视频帧,第一视频帧为当前视频帧的重建帧。
编码器通过第二神经网络对当前视频帧进行压缩编码,以得到当前视频帧的第二特征的第二压缩信息,当前视频帧的参考帧用于当前视频帧的第二特征的生成过程;通过第二神经网络生成第二视频帧,第二视频帧为当前视频帧的重建帧。
编码器根据第一压缩信息、第一视频帧、第二压缩信息和第二视频帧,确定与当前视频帧对应的压缩信息,其中,确定的压缩信息是通过第一神经网络得到的,确定的压缩信息为第一压缩信息;或者,确定的压缩信息是通过第二神经网络得到的,确定的压缩信息为第二压缩信息。
本实现方式中,根据至少一个当前视频帧的第一压缩信息、第一视频帧、当前视频帧的第二压缩信息以及第二视频帧,从第一压缩信息和第二压缩信息中选取最终需要发送的压缩信息;相对于按照网络选择策略从多个神经网络中确定目标神经网络,再利用目标神 经网络生成目标压缩信息的方式,能够尽量提高整个当前视频序列所对应的压缩信息的性能。
在第二方面的一种可能实现方式中,针对当前视频序列中不同的视频帧,编码器可以采用相同的目标压缩信息的选取方式。具体的,编码器根据第一压缩信息和第一视频帧,计算与第一压缩信息对应的第一评分值(也即第一神经网络所对应的第一评分值),根据第二压缩信息和第二视频帧,计算与第二压缩信息对应的第二评分值(也即第二神经网络所对应的第二评分值),编码器从第一评分值和第二评分值中选取取值最低的评分值,从第一压缩信息和第二压缩信息中确定取值最低的评分值所对应的压缩信息确定为当前视频帧的压缩信息,也即确定取值最低的评分值所对应的神经网络为目标神经网络。
本实现方式中,对于当前视频序列中的每个视频帧,编码器均先通过第一神经网络和第二神经网络对当前视频帧执行压缩操作,并获取与第一压缩信息对应的第一评分值,以及与第二压缩信息对应的第二评分值,并从中确定取值最低的评分值,能够尽量使得与整个当前视频序列中所有视频帧的评分值较低,以提高整个当前视频序列所对应的压缩信息的性能。
在第二方面的一种可能实现方式中,编码器可以将一个周期作为计算单位,根据与一个周期内的前两个当前视频帧对应的两个第一评分值,生成与一个周期内多个第一评分值对应的第一拟合公式的系数和偏移量的值;并根据与一个周期内的前两个当前视频帧对应的两个第二评分值,生成与一个周期内多个第二评分值对应的第二拟合公式的系数和偏移量的值。编码器根据第一拟合公式和第二拟合公式,从第一压缩信息和第二压缩信息中确定当前视频帧的压缩信息,其中,优化目标为使得一个周期内的总评分值的平均值最小,也即优化目标为使得一个周期内的总评分值的取值最小。
本申请实施例中,技术人员在研究中发现单个周期内的第一评分值和第二评分值的变化规律,并将一个周期内的总评分值的平均值最低做为优化目标,也即在确定与每个当前视频帧对应的目标压缩信息时,不仅要考虑当前视频帧的评分值,还会考虑整个周期内的评分值的平均值,以进一步降低与整个当前视频序列中所有视频帧所对应的评分值,以进一步提高整个当前视频序列所对应的压缩信息的性能。
本申请实施例的第二方面中,编码器还可以执行第一方面的各个可能实现方式中编码器执行的步骤,对于本申请实施例第二方面以及第二方面的各种名词的含义、各种可能实现方式的具体实现步骤,以及每种可能实现方式所带来的有益效果,均可以参考第一方面中各种可能的实现方式中的描述,此处不再一一赘述。
第三方面,本申请实施例提供了一种视频帧的压缩方法,可将人工智能技术应用于视频帧编解码领域中。方法可以包括:编码器通过第一神经网络对第三视频帧进行压缩编码,以得到与第三视频帧对应的第一压缩信息,第一压缩信息包括第三视频帧的第一特征的压缩信息,第三视频帧的参考帧用于第三视频帧的第一特征的压缩过程;编码器通过第二神经网络对第四视频帧进行压缩编码,以得到与第四视频帧对应的第二压缩信息,第二压缩信息包括第四视频帧的第二特征的压缩信息,第四视频帧的参考帧用于第四视频帧的第二特征的生成过程。
本申请实施例的第三方面中,编码器还可以执行第一方面的各个可能实现方式中编码器执行的步骤,对于本申请实施例第三方面以及第三方面的各种名词的含义、各种可能实现方式的具体实现步骤,以及每种可能实现方式所带来的有益效果,均可以参考第一方面中各种可能的实现方式中的描述,此处不再一一赘述。
第四方面,本申请实施例提供了一种视频帧的解压缩方法,可将人工智能技术应用于视频帧编解码领域中。解码器获取当前视频帧的压缩信息,根据当前视频帧的压缩信息,通过目标神经网络执行解压缩操作,以得到当前视频帧的重建帧。其中,目标神经网络为为从多个神经网络中选择出的一个神经网络,多个神经网络包括第三神经网络和第四神经网络。若目标神经网络为第三神经网络,则压缩信息包括当前视频帧的第一特征的第一压缩信息,当前视频帧的参考帧用于第一压缩信息的解压缩过程,以得到当前视频帧的第一特征,当前视频帧的第一特征用于当前视频帧的重建帧的生成过程;若目标神经网络为第四神经网络,则压缩信息包括当前视频帧的第二特征的第二压缩信息,第二压缩信息用于供解码器执行解压缩操作以得到当前视频帧的第二特征,当前视频帧的参考帧和当前视频帧的第二特征用于当前视频帧的重建帧的生成过程,当前视频帧的重建帧和当前视频帧的参考帧被包括于当前视频序列。
在第四方面的一种可能实现方式中,第三神经网络包括熵解码层和解码Decoding网络,其中,通过熵解码层利用当前视频帧的参考帧执行当前视频帧的第一压缩信息的熵解码过程,通过解码网络利用当前视频帧的第一特征生成当前视频帧的重建帧。
进一步地,若压缩信息为通过第三神经网络进行解压缩,则解码器根据当前视频帧的压缩信息,通过目标神经网络执行解压缩操作,以得到当前视频帧的重建帧,可以包括:解码器根据当前视频帧的预测特征,生成第一特征的概率分布,其中,当前视频帧的预测特征为根据当前视频帧的参考帧,对第一特征进行预测得到的。解码器根据第一特征的概率分布,对压缩信息进行熵解码,得到第一特征,对第一特征进行逆变换解码以得到当前视频帧的重建帧。
在第四方面的一种可能实现方式中,第四神经网络包括熵解码层和卷积网络,其中,通过熵解码层对第二压缩信息进行熵解码,通过卷积网络利用当前视频帧的参考帧和当前视频帧的第二特征执行当前视频帧的重建帧的生成过程。
进一步地,若压缩信息为通过第四神经网络进行解压缩,则解码器根据当前视频帧的压缩信息,通过目标神经网络执行解压缩操作,以得到当前视频帧的重建帧,可以包括:解码器对第二压缩信息进行解压缩处理,得到第四视频帧的第二特征,也即得到了原始的当前视频帧相对于当前视频帧的参考帧的光流和原始的当前视频帧与预测的当前视频帧之间的残差。编码器根据原始的当前视频帧相对于当前视频帧的参考帧的光流和当前视频帧的参考帧,对当前视频帧进行预测,得到预测的当前视频帧;根据原始的当前视频帧与预测的当前视频帧之间的残差和预测的当前视频帧,生成当前视频帧的重建帧。
在第四方面的一种可能实现方式中,方法还可以包括:解码器获取与至少一个压缩信息一一对应的至少一个指示信息;根据该至少一个指示信息和当前视频帧的压缩信息,从包括第三神经网络和第四神经网络的多个神经网络中确定与当前视频帧对应的目标神经网 络。
对于本申请实施例第四方面以及第四方面的各种名词的含义和每种可能实现方式所带来的有益效果,均可以参考第一方面中各种可能的实现方式中的描述,此处不再一一赘述。
第五方面,本申请实施例提供了一种视频帧的解压缩方法,可将人工智能技术应用于视频帧编解码领域中。解码器通过第三神经网络对第三视频帧的第一压缩信息进行解压缩,以得到第三视频帧的重建帧,第一压缩信息包括第三视频帧的第一特征的压缩信息,第三视频帧的参考帧用于第一压缩信息的解压缩过程,以得到第三视频帧的第一特征,第三视频帧的第一特征用于第三视频帧的重建帧的生成过程。解码器通过第四神经网络对第四视频帧的第二压缩信息进行解压缩,以得到解压缩后的第四视频帧,第二压缩信息包括第四视频帧的第二特征的压缩信息,第二压缩信息用于供解码器执行解压缩操作以得到第四视频帧的第二特征,第四视频帧的参考帧和第四视频帧的第二特征用于第四视频帧的重建帧的生成过程。
本申请实施例的第五方面中,解码器还可以执行第四方面的各个可能实现方式中解码器执行的步骤,对于本申请实施例第五方面以及第五方面的各种名词的含义、各种可能实现方式的具体实现步骤,以及每种可能实现方式所带来的有益效果,均可以参考第四方面中各种可能的实现方式中的描述,此处不再一一赘述。
第六方面,本申请实施例提供了一种编码器,其特征在于,包括处理电路,用于执行上述第一方面、第二方面、第三方面、第四方面或第五方面中任一方面所述的方法。
第七方面,本申请实施例提供了一种解码器,其特征在于,包括处理电路,用于执行上述第一方面、第二方面、第三方面、第四方面或第五方面中任一方面所述的方法。
第八方面,本申请实施例提供了一种计算机程序产品,当所述计算机程序产品在计算机上运行时,使得计算机执行上述第一方面、第二方面、第三方面、第四方面或第五方面中任一方面所述的方法。
第九方面,本申请实施例提供了一种编码器,可以包括一个或多个处理器,非瞬时性计算机可读存储介质,耦合到所述处理器,存储有所述处理器执行的程序指令,其中,所述程序指令在由所述处理器执行时,使得所述编码器实现上述第一方面、第二方面或第三方面所述的视频帧的压缩方法。
第十方面,本申请实施例提供了一种解码器,可以包括一个或多个非瞬时性计算机可读存储介质,耦合到所述处理器,存储有所述处理器执行的程序指令,其中,所述程序指令在由所述处理器执行时,使得所述解码器执行时实现上述第四方面或第五方面所述的视频帧的解压缩方法。
第十一方面,本申请实施例提供了一种非瞬时性计算机可读存储介质,所述非瞬时性计算机可读存储介质包括程序代码,当包括程序代码在计算机上运行时,使得计算机执行上述第一方面、第二方面、第三方面、第四方面或第五方面中任一方面所述的方法。
第十二方面,本申请实施例提供了一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行上述第一方面、第二方面、第三方面、第四方面或第五方面中任一方面所述的方法。
第十三方面,本申请实施例提供了一种芯片系统,该芯片系统包括处理器,用于实现上述各个方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存服务器或通信设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
附图说明
图1a为本申请实施例提供的人工智能主体框架的一种结构示意图;
图1b为本申请实施例提供的视频帧的压缩及解压缩方法的一种应用场景图;
图1c为本申请实施例提供的视频帧的压缩及解压缩方法的另一种应用场景图;
图2为本申请实施例提供的视频帧的压缩方法的一种原理示意图;
图3为本申请实施例提供的视频帧的压缩方法的一种流程示意图;
图4为本申请实施例提供的视频帧的压缩方法中当前视频帧的位置与采用的目标神经网络之间对应关系的一种示意图;
图5a为本申请实施例提供的第一神经网络的一种结构示意图;
图5b为本申请实施例提供的第二神经网络的一种结构示意图;
图5c为本申请实施例提供的视频帧的压缩方法中第一特征和第二特征的一种对比示意图;
图6为本申请实施例提供的视频帧的压缩方法的另一种原理示意图;
图7a为本申请实施例提供的视频帧的压缩方法的另一种流程示意图;
图7b为本申请实施例提供的视频帧的压缩方法中第一评分值和第二评分值的一个示意图;
图7c为本申请实施例提供的视频帧的压缩方法中计算第一拟合公式的系数和偏移量的值以及第二拟合公式的系数和偏移量的一个示意图;
图8为本申请实施例提供的视频帧的压缩方法的另一种流程示意图;
图9为本申请实施例提供的视频帧的压缩方法的一个示意图;
图10a为本申请实施例提供的视频帧的解压缩方法的一种流程示意图;
图10b为本申请实施例提供的视频帧的解压缩方法的另一种流程示意图;
图11为本申请实施例提供的视频帧的解压缩方法的另一种流程示意图;
图12为本申请实施例提供的视频帧的压缩以及解压缩系统的训练方法一种流程示意图;
图13为本申请实施例提供的视频编解码系统的一种系统架构图;
图14为本申请实施例提供的视频编解码系统的另一种系统架构图;
图15为本申请实施例提供的视频译码设备的一种示意图;
图16为本申请实施例提供的装置的一种简化框图。
具体实施方式
本申请的说明书和权利要求书及上述附图中的术语“第一”、第二”等是用于区别类似 的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
首先对人工智能系统总体工作流程进行描述,请参见图1a,图1a示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片提供,作为示例,该智能芯片包括中央处理器(central processing unit,CPU)、神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程逻辑门阵列(field programmable gate array,FPGA)等硬件加速芯片;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理, 语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶、智慧城市等。
本申请实施例主要可以应用于对上述各种应用领域中需要对视频中的视频帧进行编解码的场景中。具体的,为了更直观地理解本方案的应用场景,请参阅图1b,图1b为本申请实施例提供的视频帧的压缩及解压缩方法的一种应用场景图。参阅图1b,例如客户端的相册中可以存储有视频,就会存储将相册中的视频发送至云端服务器的需求,则客户端(也即编码器)可以利用AI技术将视频帧进行压缩编码,得到每个视频帧所对应的压缩信息;将每个视频帧所对应的压缩信息传输给云端服务器,对应的,云端服务器(也即解码器)可以利用AI技术进行解压缩,以得到视频帧的重建帧,应理解,图1b中的示例仅为方便理解本方案,不用于限定本方案。
作为另一示例,例如智慧城市领域中,监控会需要将采集到的视频发送给管理中心,则监控(也即编码器)在将视频发送给管理中心之前,需要对视频中的视频帧进行压缩,对应的,管理中心(也即解码器)需要对视频中的视频帧进行解压缩,以得到视频帧。
作为另一示例,为了更直观地理解本方案的应用场景,请参阅图1c,图1c为本申请实施例提供的视频帧的压缩及解压缩方法的另一种应用场景图。图1c中以本申请实施例应用于直播场景中为例,主播利用客户端进行视频的采集,客户端需要将采集的视频发送给服务器,再由服务器将视频分发给观看用户,则客户端(也即编码器)在将视频发送给服务器之前,需要利用AI技术对视频中的视频帧进行压缩编码,对应的,用户所使用的客户端(也即解码器)需要利用AI技术进行解压缩操作,以得到视频帧的重建帧等等,应理解,图1c中的示例仅为方便理解本方案,不用于限定本方案。
需要说明,此处举例仅为方便对本申请实施例的应用场景进行理解,不对本申请实施例的应用场景进行穷举。
本申请实施例中是利用AI技术(也即神经网络)对视频帧进行压缩编码和解压缩的,则本申请实施例中会包括前述神经网络的推理阶段和前述神经网络的训练阶段,神经网络的推理阶段和训练阶段的流程有所不同,以下分别对推理阶段和训练阶段进行描述。
一、推理阶段
参阅上述描述可知,在本申请实施例提供的压缩编码以及解压缩方法中,由编码器执行压缩编码的操作,由解码器执行解压缩的操作,以下对编码器和解码器的操作分别进行描述。进一步地,由于编码器中配置有多个神经网络,针对编码器生成与当前视频对应的目标压缩信息的过程。在一种实现方式中,编码器可以先根据网络选择策略从多个神经网络中确定目标神经网络,再通过目标神经网络生成当前视频帧的目标压缩信息。在另一种实现方式中,编码器可以分别通过多个神经网络分别生成当前视频帧的多个压缩信息,根据生成的多个压缩信息,确定与当前视频帧对应的目标压缩信息。由于前述两种实现方式的实现流程有所不同,以下将分别进行描述。
(一)、编码器先从多个神经网络中选择目标神经网络
本申请的一些实施例中,编码器是先利用网络选择策略从多个神经网络中选择一个用于处理当前视频帧的目标神经网络,为更直观地理解本方案,请参阅图2,图2为本申请实施例提供的视频帧的压缩方法的一种原理示意图。如图2所示,针对当前视频序列中的任一视频帧(也即图2中的当前视频帧),编码器会根据网络选择策略从多个神经网络中选择一个目标神经网络,并利用目标神经网络对当前视频帧进行压缩编码,得到当前视频帧所对应的目标压缩信息,应理解,图2中的示例仅为方便理解本方案,不用于限定本方案。具体的,参阅图3,图3为本申请实施例提供的视频帧的压缩方法的一种流程示意图,本申请实施例提供的视频帧的压缩方法可以包括:
301、编码器根据网络选择策略从多个神经网络中确定目标神经网络。
本申请实施例中,编码器配置有多个神经网络,该多个神经网络至少包括第一神经网络、第二神经网络或其他用于执行压缩操作的神经网络,第一神经网络、第二神经网络和其他类型的神经网络均为执行过训练操作的神经网络。在编码器对当前视频序列中的一个当前视频帧进行处理的过程中,编码器可以根据网络选择策略从多个神经网络中确定目标神经网络,并通过目标神经网络对当前视频帧进行压缩编码,以得到与当前视频帧对应的目标压缩信息,目标压缩信息指的是编码器最终决定发送给解码器的压缩信息,也即目标压缩信息是多个神经网络中的一个目标神经网络生成的。
需要说明的是,本申请的后续实施例中,仅以多个神经网络中包括第一神经网络和第二神经网络为例进行说明,对于多个神经网络中包括三个或三个以上的神经网络的情况,可以参阅本申请实施例中对于多个神经网络中包括第一神经网络和第二神经网络的描述,本申请实施例中不再一一赘述。
其中,视频编码通常是指处理形成视频或视频序列的图像序列。在视频编码领域,术语“图像(picture)”、“视频帧(frame)”或“图片(image)”可以用作同义词。视频编码在源侧执行,通常包括处理(例如,压缩)原始视频帧以减少表示该视频帧所需的数据量(从而更高效存储和/或传输)。视频解码在目的地侧执行,通常包括相对于编码器作逆处理,以重建视频帧。编码部分和解码部分也合称为编解码(编码和解码,CODEC)。
网络选择策略与如下任一种或多种因素相关:当前视频帧的位置信息或当前视频帧所携带的数据量。
具体的,针对根据网络选择策略从多个神经网络中选择目标神经网络的过程。在一种实现方式中,步骤301可以包括:编码器可以获取当前视频帧在当前视频序列中的位置信息,位置信息用于指示当前视频帧为当前视频序列的第X帧;编码器根据网络选择策略,从包括第一神经网络和第二神经网络的多个神经网络中选取与当前视频序列的位置信息对应的目标神经网络。
其中,当前视频帧在当前视频序列中的位置信息具体可以表现为索引号,该索引号具体可以表现为字符串的形式,作为示例,例如当前视频帧的索引号具体可以为00000223、00000368或其他字符串等等,此处不做穷举。
网络选择策略可以为按照一定的规律交替选择第一神经网络或第二神经网络,也即编 码器在采用第一神经网络对当前视频帧的n个视频帧进行压缩编码,再采用第二神经网络对当前视频帧的m个视频帧进行压缩编码;或者,编码器在采用第二神经网络对当前视频帧的m个视频帧进行压缩编码后,再采用第一神经网络对当前视频帧的n个视频帧进行压缩编码。n和m的取值均可以为大于或等于1的整数,n和m的取值可以相同或不同。
作为示例,例如n和m的取值均为1,则网络选择策略可以为采用第一神经网络对当前视频序列中的奇数帧进行压缩编码,采用第二神经网络对当前视频序列中的偶数帧进行压缩编码;或者,网络选择策略可以为采用第二神经网络对当前视频序列中的奇数帧进行压缩编码,采用第一神经网络对当前视频序列中的偶数帧进行压缩编码。作为另一示例,例如n的取值为1,m的取值为3,网络选择策略可以为每采用第一神经网络对当前视频序列中的一个视频帧进行压缩编码后,就会采用第二神经网络对当前视频序列中连续的三个视频帧进行压缩编码等等,此处不做穷举。
为更直观地理解本方案,请参阅图4,图4为本申请实施例提供的视频帧的压缩方法中当前视频帧的位置与采用的目标神经网络之间对应关系的一种示意图。图3中以n的取值为1,m的取值为3为例进行说明,如图4所示,编码器采用第一神经网络对第t帧视频帧进行压缩编码后,采用第二神经网络分别对第t+1帧、第t+2帧和第t+3帧视频帧进行压缩编码,并再次采用第一神经网络对第t+4帧进行压缩编码,也即每采用第一神经网络对一个当前视频帧执行一次压缩编码后,就会采用第二神经网络对三个当前视频帧执行一次压缩编码,应理解,图4中的示例仅为方便理解本方案,不用于限定本方案。
在另一种实现方式中,步骤301可以包括:编码器可以获取当前视频帧的属性,从第一神经网络和第二神经网络中选取目标神经网络,其中,当前视频帧的属性用于指示当前视频帧所携带的数据量,当前视频帧的属性包括以下中的任一种或多种的组合:当前视频帧的熵、对比度、饱和度和其他类型的属性等,此处不做穷举。
进一步地,当前视频帧的熵越高,证明当前视频帧所携带的数据量越多,目标神经网络采用第二神经网络的概率越大,当前视频帧的熵越低,目标神经网络采用第二神经网络的概率越小;当前视频帧的对比度越高,证明当前视频帧所携带的数据量越多,目标神经网络采用第二神经网络的概率越大,当前视频帧的对比度越低,目标神经网络采用第二神经网络的概率越小。
本申请实施例中,根据当前视频帧在当前视频序列中的位置信息,从多个神经网络中选取目标神经网络;或者,可以根据当前视频的至少一种属性,从多个神经网络中选取目标神经网络,进而能够利用目标神经网络生成当前视频帧的压缩信息,提供了多种简单、易操作的实现方案,提高了本方案的实现灵活性。
在另一种实现方式中,编码器可以从第一神经网络和第二神经网络中任意选取一个神经网络作为目标神经网络,以利用目标神经网络生成当前视频帧的目标压缩信息。可选地,编码器可以分别配置第一神经网络的第一选取概率和第二神经网络的第二选取概率,第二选取概率的取值大于或等于第一选取概率,进而根据第一选取概率和第二选取概率执行目标神经网络的选取操作。作为示例,例如第一选取概率的取值为0.2,第二选取概率的取值为0.8;作为另一示例,例如第一选取概率的取值为0.3,第二选取概率的取值为0.7等, 此处不对第一选取概率和第二选取概率的取值进行穷举。
302、编码器通过目标神经网络对当前视频帧进行压缩编码,以得到与当前视频帧对应的目标压缩信息。
本申请实施例中,目标神经网络可以为第一神经网络、第二神经网络或其他用于对视频帧进行压缩的网络等。若目标压缩信息通过第一神经网络得到,则目标压缩信息包括当前视频帧的第一特征的第一压缩信息,当前视频帧的参考帧用于当前视频帧的第一特征的压缩过程,且当前视频帧的参考帧不用于当前视频帧的第一特征的生成过程。
其中,当前视频帧的参考帧和当前视频帧均来源于当前视频序列;当前视频帧为包括于当前视频序列中的原始视频帧。在一种实现方式中,当前视频帧的参考帧可以为当前视频序列中的原始视频帧,该参考帧在当前视频序列中的排序位置可以位于当前视频帧之前,也可以位于当前视频帧之后,也即当播放该当前视频序列时,参考帧出现的时间可以早于当前视频帧,也可以晚于当前视频帧。
在另一种实现方式中,当前视频帧的参考帧可以不是当前视频序列中的原始视频帧,与当前视频帧的参考帧对应的原始的参考帧在当前视频序列中的排序位置可以位于当前视频帧之前,也可以位于当前视频帧之后。当前视频帧的参考帧可以是编码器对原始的参考帧进行变换编码并进行逆变换解码后得到的视频帧;或者,当前视频帧的参考帧是编码器对原始的参考帧进行压缩编码并进行解压缩后得到的视频帧。更进一步地,前述压缩操作可以为通过第一神经网络实现,也可以为通过第二神经网络实现。
参阅专利申请号为CN202011271217.8的申请文件中的描述,第一神经网络至少可以包括编码(Encoding)网络和熵编码层,其中,通过编码网络从当前视频帧中获取当前视频帧的第一特征;通过熵编码层利用当前视频帧的参考帧执行当前视频帧的第一特征的压缩过程,输出当前视频帧对应的第一压缩信息。
为更直观地理解本方案,请参阅图5a,图5a为本申请实施例提供的第一神经网络的一种结构示意图。如图5a所示,将当前视频帧通过编码网络进行编码,并进行量化处理后,得到当前视频帧的第一特征。通过熵编码层利用当前视频帧的参考帧,对当前视频帧的第一特征进行压缩处理,输出当前视频帧对应的第一压缩信息(也即当前视频帧对应的目标压缩信息的一个示例),应理解,图5a中的示例仅为方便理解本方案,不用于限定本方案。
具体的,针对编码器通过第一神经网络生成与当前视频帧对应的第一压缩信息的过程。编码器可以通过第一编码网络(Encoding Network)对当前视频帧进行变换编码,在进行变换编码之后会再进行量化后得到当前视频帧的第一特征,也即当前视频帧的第一特征是仅基于当前视频帧就能得到的,该第一特征的生成过程中不需要借助当前视频帧的参考帧。
进一步地,第一编码网络具体可以表现为一个多层的卷积网络。第一特征中包括M个像素的特征,具体可以表现为L维的张量,作为示例,例如可以为一维的张量(也即向量)、二维的张量(也即矩阵)、三维的张量或更高维的张量等,此处不做穷举。
编码器根据当前视频帧的N个参考帧,对当前视频帧的特征进行预测,以生成当前视频帧的第一预测特征,根据当前视频帧的第一预测特征,生成当前视频帧的第一特征的概率分布。编码器根据当前视频帧的第一特征的概率分布,对当前视频帧的第一特征进行熵 编码,得到第一压缩信息。
其中,当前视频帧的第一预测特征为当前视频帧的第一特征的预测结果,当前视频帧的第一预测特征也包括M个像素的特征,当前视频帧的第一预测特征具体也可以表现为张量,当前视频帧的第一预测特征的数据形状与当前视频帧的第一特征的数据形状相同,第一预测特征和第一特征的形状相同指的是第一预测特征和第一特征均为L维张量,且第一预测特征的L维中的第一维和第一特征的L维中的第二维的尺寸相同,L为大于或等于1的整数,第一维为第一预测特征的L维中的任一维,第二维为第一特征的L维中与第一维相同的维度。
当前视频帧的第一特征的概率分布包括当前视频帧的第一特征的均值和当前视频帧的第一特征的方差。进一步地,第一特征的均值和第一特征的方式均可以表现为L维的张量,第一特征的均值的数据形状与第一特征的数据形状相同,第一特征的方差的形状与第一特征的数据形状相同,从而第一特征的均值中包括与M个像素中每个像素对应的值,第一特征的方差中包括与M个像素中每个像素对应的值。
具体的,针对编码器根据当前视频帧的N个参考帧,对当前视频帧的特征进行预测,以生成当前视频帧的第一预测特征的具体实现方式,以及,编码器根据当前视频帧的第一预测特征,生成当前视频帧的第一特征的概率分布的具体实现方式,均可以参阅专利申请号为CN202011271217.8的申请文件中的描述。
区别在于,专利申请号为CN202011271217.8的申请文件中是基于N个第二视频帧,对第一视频帧的特征进行预测,以生成第一视频帧的第一预测特征,并根据第一视频帧的第一预测特征,生成第一视频帧的第一特征的概率分布。本申请实施例中是基于当前视频帧的N个参考帧,对当前视频帧进行预测,以生成当前视频帧的第一预测特征,并根据当前视频帧的第一预测特征,生成当前视频帧的第一特征的概率分布。也即将专利申请号为CN202011271217.8的“第一视频帧”替换为本申请实施例中的“当前视频帧”,将专利申请号为CN202011271217.8的“第二视频帧”替换为本申请实施例中的“当前视频帧的参考帧”,具体实现方式可以参阅专利申请号为CN202011271217.8的申请文件中的描述,此处不做赘述。
本申请实施例中,由于编码器为根据当前视频帧所对应的第一预测特征生成当前视频帧的第一特征的概率分布,进而根据当前视频帧的第一特征的概率分布,对当前视频帧的第一特征进行压缩编码,从而得到当前视频帧的第一压缩信息,由于第一预测特征与第一特征之间的相似度越高,对第一特征的压缩率就会越大,最后得到的第一压缩信息就会越小,而当前视频帧的第一预测特征为根据当前视频帧的N个参考帧,对当前视频帧的特征进行预测得到的,以提高当前视频帧的第一预测特征与当前视频帧的第一特征之间的相似度,从而能够降低压缩后的第一压缩信息的大小,也即不仅能够保证解码器获得的重建帧的质量,也能减少编码器与解码器传输的数据量的大小。
若目标压缩信息通过第二神经网络得到,则目标压缩信息包括当前视频帧的第二特征的第二压缩信息,当前视频帧的参考帧用于当前视频帧的第二特征的生成过程。第二神经网络包括卷积网络和熵编码层,卷积网络包括多个卷积层和激励ReLU层,其中,通过卷 积网络利用当前视频帧的参考帧执行当前视频帧的第二特征的生成过程,通过熵编码层对当前视频帧的第二特征进行压缩,输出当前视频帧所对应的第二压缩信息。
本申请实施例中,提供了第一神经网络和第二神经网络的具体网络结构,提高了本方案与具体应用场景的结合度。
具体的,编码器在生成原始的当前视频帧相对于当前视频帧的参考帧之间的光流后,可以将前述光流进行压缩编码,得到压缩后的光流。当前视频帧的第二特征可以仅包括原始的当前视频帧相对于当前视频帧的参考帧的光流。
可选地,编码器还可以根据原始的当前视频帧相对于当前视频帧的参考帧的光流和当前视频帧的参考帧,生成预测的当前视频帧;编码器计算原始的当前视频帧与预测的当前视频帧之间的残差,并对原始的当前视频帧相对于当前视频帧的参考帧之间的光流,和原始的当前视频帧与预测的当前视频帧之间的残差进行压缩编码,输出当前视频帧所对应的第二压缩信息。其中,当前视频帧的第二特征包括原始的当前视频帧相对于当前视频帧的参考帧的光流和原始的当前视频帧与预测的当前视频帧之间的残差。
进一步地,编码器在得到当前视频帧的第二特征之后,由于当前视频帧的第二特征的数据量较小,编码器可以直接对当前视频帧的第二特征执行压缩操作,以得到与当前视频帧对应的第二压缩信息。其中,前述压缩操作可以通过神经网络实现,也可以通过非神经网络的方式实现,作为示例,例如前述压缩编码的方式可以为熵编码。
为更直观地理解本方案,请参阅图5b,图5b为本申请实施例提供的第二神经网络的一种结构示意图。如图5b所示,编码器将当前视频帧和当前视频帧的参考帧输入至卷积网络,通过卷积网络进行光流估计,得到当前视频帧相对于当前视频帧的参考帧的光流。编码器根据当前视频帧相对于当前视频帧的参考帧的光流,和当前视频帧的参考帧,通过卷积网络生成当前视频帧的重建帧;并获取当前视频帧的重建帧和当前视频帧之间的残差。编码器可以通过熵编码层对当前视频帧相对于当前视频帧的参考帧的光流,和,当前视频帧的重建帧和当前视频帧之间的残差进行压缩,输出当前视频帧的第二压缩信息,应理解,图5b中的示例仅为方便理解本方案,不用于限定本方案。
为了更直观地理解第一特征和第二特征的区别,请参阅图5c,图5c为本申请实施例提供的视频帧的压缩方法中第一特征和第二特征的一种对比示意图。图5c包括(a)和(b)两个子示图,图5c的(a)子示意图代表生成当前视频帧的第一特征的一种示意图,图5c的(b)子示意图代表生成当前视频帧的第二特征的一种示意图。先参阅图5c的(a)子示意图,将当前视频帧输入至编码网络,在通过编码网络进行变换编码之后会再进行量化(quantization,Q)后,得到当前视频帧的第一特征。
再参阅图5c的(b)子示意图,图5c的(b)子示意图的虚线框中的内容代表当前视频帧的第二特征,由于图5c的(b)子示意图中已经详细的展示了当前视频帧的第二特征不仅包括原始的当前视频帧相对于当前视频帧的参考帧的光流,还包括原始的当前视频帧与预测的当前视频帧之间的残差,此处不再一一赘述当前视频帧的第二特征的生成过程。通过对比图5c的(a)子示意图和图5c的(b)子示意图可知,当前视频帧的第一特征的生成过程完全不依赖当前视频帧的参考帧,而当前视频帧的第二特征的生成过程需要依赖 当前视频帧的参考帧,应理解,图5c中的示例仅为方便理解第一特征和第二特征的概念,不用于限定本方案。
需要说明的是,编码器中也可以配置其他用于对视频帧进行压缩编码的神经网络(为方便描述,后续称为“第五神经网络”),但编码器至少配置有第一神经网络和第二神经网络,对于采用第一神经网络和第二神经网络进行压缩编码的详细过程,将在后续实施例中进行描述,此处暂时不做介绍。作为示例,例如第五神经网络可以为直接对当前视频帧进行压缩的神经网络,也即编码器可以将当前视频帧输入第五神经网络,通过第五神经网络直接对当前视频帧进行压缩,得到第五神经网络输出的与当前视频帧对应的第三压缩信息。进一步地,第五神经网络具体可以采用卷积神经网络。
303、编码器生成与目标压缩信息对应的指示信息,指示信息用于指示目标压缩信息通过第一神经网络和第二神经网络中的目标神经网络得到。
本申请实施例中,编码器在得到一个或多个当前视频帧的目标压缩信息后,还可以生成与至少一个当前视频帧的目标压缩信息一一对应的至少一个指示信息,前述至少一个指示信息用于指示每个目标压缩信息通过第一神经网络和第二神经网络中的目标神经网络得到,也即该一个指示信息用于指示一个目标压缩信息是通过第一神经网络和第二神经网络中的哪一个神经网络得到的。
其中,与当前视频序列中的多个视频帧的目标压缩信息对应的多个指示信息具体可以表现为字符串或其他形式。作为示例,例如与当前视频序列中的多个视频帧的目标压缩信息对应的多个指示信息具体可以为0010110101,前述字符串中的一个字符代表一个指示信息,当一个指示信息为0时,代表与该指示信息对应的当前视频帧采用第一神经网络进行压缩处理;当一个指示信息为1时,代表与该指示信息对应的当前视频帧采用第二神经网络进行压缩处理。
具体的,在一种实现方式中,编码器可以每获取到一个当前视频帧的目标压缩信息后,就生成与一个当前视频帧的目标压缩信息对应的一个指示信息,也即编码器可以交替执行步骤303和步骤301至302。
在另一种实现方式中,编码器也可以通过步骤301生成预设个数的当前视频帧的目标压缩信息后,再生成与前述预设个数的当前视频帧对应的预设个数的指示信息,该预设个数为大于1的整数,作为示例,例如可以为3、4、5、6或其他数值等,此处不做限定。
在另一种实现方式中,编码器也可以通过步骤301和302生成与整个当前视频序列对应的多个目标压缩信息后,再通过步骤303生成与整个当前视频序列对应的多个指示信息,具体实现方式,此处不做限定。
304、编码器发送当前视频帧的目标压缩信息。
本申请实施例中,编码器可以基于文件传输协议(file transfer protocol,FTP)的约束,向解码器发送当前视频序列中至少一个当前视频帧的目标压缩信息。
具体的,在一些实现方式中,编码器可以直接将至少一个目标压缩信息发送给解码器;在另一种实现方式中,编码器也可以为将至少一个目标压缩信息发送给服务器或管理中心等中间设备,由中间设备发送给解码器。
可选地,若目标压缩信息为通过第一神经网络生成的,则参阅专利申请号为CN202011271217.8的申请文件中的描述,编码器还可以根据生成当前视频帧的第一预测特征的方式,在向解码器发送当前视频帧的第一压缩信息的同时,向解码器发送与当前视频帧对应的第一帧间边信息、第二帧间边信息、第一帧内边信息、第二帧内边信息中的一种或两种信息;对应的,解码器可以接收到与当前视频帧对应的第一帧间边信息、第二帧间边信息、第一帧内边信息、第二帧内边信息中的一种或两种信息。具体发送那种信息需要结合在对当前视频帧的第一压缩信息进行解压缩过程中需要哪种类型的信息来确定。
进一步地,对于第一帧间边信息、第二帧间边信息、第一帧内边信息以及第二帧内边信息的含义和作用,均可以参阅专利申请号为CN202011271217.8的申请文件中的描述,此处不再一一进行赘述。
305、编码器发送与当前视频帧的目标压缩信息对应的指示信息。
本申请实施例中,步骤305为可选步骤,若未执行步骤303,则不执行步骤305,若执行步骤303,则执行步骤305。若执行步骤305,则步骤305可以与步骤304可以同时执行,也即编码器基于FTP协议(也即文件传输协议的简称)的约束,向解码器发送当前视频序列中至少一个当前视频帧的目标压缩信息,以及与前述至少一个当前视频帧的目标压缩信息一一对应的至少一个指示信息。或者,步骤304和步骤305也可以分开执行,本申请实施例不限定步骤304和步骤305的执行顺序。
对应的,解码器能够获取到与多个目标压缩信息对应的多个指示信息,从而解码器能够得知当前视频序列中的每个视频帧是采用第一神经网络和第二神经网络中的哪个神经网络来执行解压缩操作,有利于提高解码器对压缩信息进行解码的时间,也即有利于提高整个编码器和解码器进行视频帧传输的效率。
本申请实施例中,由于当压缩信息通过第一神经网络得到时,压缩信息携带的是当前视频帧的第一特征的压缩信息,而当前视频帧的参考帧仅用于当前视频帧的第一特征的压缩过程,不用于当前视频帧的第一特征的生成过程,从而解码器在根据第一压缩信息执行解压缩操作以得到当前视频帧的第一特征后,不需要借助当前视频帧的参考帧就能够得到当前视频帧的重建帧,所以当压缩信息通过第一神经网络得到时,该当前视频帧的重建帧的质量不会依赖于该当前视频帧的参考帧的重建帧的质量,进而避免了误差在逐帧之间累积,以提高视频帧的重建帧的质量;此外,由于当前视频帧的第二特征是根据当前视频帧的参考帧生成的,第二特征的第二压缩信息所对应的数据量比第一特征的第一压缩信息所对应的数据量小,编码器可以利用第一神经网络和第二神经网络,来对当前视频序列中不同的视频帧进行处理,以综合第一神经网络和第二神经网络的优点,以实现在尽量减少需要传输的数据量的基础上,提高视频帧的重建帧的质量。
(二)、编码器通过多个神经网络分别进行压缩编码后,再确定目标压缩信息
本申请的一些实施例中,编码器是先通过多个不同的神经网络分别对当前视频帧进行压缩编码,再确定与当前视频帧对应的目标压缩信息,为更直观地理解本方案,请参阅图6,图6为本申请实施例提供的视频帧的压缩方法的另一种原理示意图。图6中仅以多个神经网络包括第一神经网络和第二神经网络为例,编码器通过第一神经网络对当前视频帧进 行压缩编码,得到当前视频帧的第一特征的第一压缩信息(也即图6中的r p),并根据第一压缩信息,生成当前视频帧的重建帧(也即图6中的d p)。通过第二神经网络对当前视频帧进行压缩编码,得到当前视频帧的第二特征的第二压缩信息(也即图6中的r r),根据第二压缩信息,生成当前视频帧的重建帧(也即图6中的d r)。编码器根据r p、d p、r r、d r和网络选择策略,从第一压缩信息和第二压缩信息中确定当前视频帧所对应的目标压缩信息,应理解,图6中的示例仅为方便理解本方案,不用于限定本方案。
具体的,参阅图7a,图7a为本申请实施例提供的视频帧的压缩方法的另一种流程示意图,本申请实施例提供的视频帧的压缩方法可以包括:
701、编码器通过第一神经网络对当前视频帧进行压缩编码,以得到当前视频帧的第一特征的第一压缩信息,当前视频帧的参考帧用于当前视频帧的第一特征的压缩过程。
本申请实施例中,编码器在得到当前视频帧后,可以通过多个神经网络中的第一神经网络对当前视频帧进行压缩编码,以得到当前视频帧的第一特征的第一压缩信息。其中,当前视频帧的第一特征的含义、当前视频帧的第一特征的第一压缩信息的含义以及步骤701的具体实现方式均可以参阅图3对应实施例中的描述,此处不做赘述。
702、编码器通过第一神经网络生成第一视频帧,第一视频帧为当前视频帧的重建帧。
本申请的一些实施例中,编码器在通过第一神经网络生成当前视频帧的第一特征的第一压缩信息后,还可以通过第一神经网络进行解压缩处理,以生成第一视频帧,第一视频帧为当前视频帧的重建帧。
其中,第一压缩信息包括当前视频帧的第一特征的压缩信息,当前视频帧的参考帧用于第一压缩信息的解压缩过程,以得到当前视频帧的第一特征,当前视频帧的第一特征用于当前视频帧的重建帧的生成过程。也即编码器在对第一压缩信息进行解压缩处理后,不再需要借助当前视频帧的参考帧就能得到当前视频帧的重建帧。
第一神经网络还可以包括熵解码层和解码(Decoding)网络,其中,通过熵解码层利用当前视频帧的参考帧执行当前视频帧的第一压缩信息的解压缩过程,通过解码网络利用当前视频帧的第一特征生成当前视频帧的重建帧。
具体的,编码器可以通过熵解码层根据当前视频帧的N个参考帧的重建帧,对当前视频帧的特征进行预测,以得到当前视频帧的第一预测特征,并通过熵解码层根据当前视频帧的第一预测特征,生成当前视频帧的第一特征的概率分布。编码器通过熵解码层根据当前视频帧的第一特征的概率分布,对当前视频帧的第一压缩信息进行熵解码,得到当前视频帧的第一特征。编码器还会通过第一解码(decoder)网络,对当前视频帧的第一特征进行逆变换解码,得到当前视频帧的重建帧。其中,第一解码网络与第一编码网络是对应的,第一解码网络也可以表现为一个多层的卷积网络。
更具体的,编码器根据当前视频帧的N个参考帧的重建帧生成当前视频帧的第一预测特征的具体实现方式,与编码器根据当前视频帧的N个参考帧生成当前视频帧的第一预测特征的具体实现方式类似;编码器根据当前视频帧的第一预测特征,生成当前视频帧的第 一特征的概率分布的具体实现方式,与编码器根据当前视频帧的第一预测特征,生成当前视频帧的第一特征的概率分布的具体实现方式类似;前述步骤的具体实现方式均可以参阅图3对应实施例中对步骤302的描述,此处不做赘述。
703、编码器通过第二神经网络对当前视频帧进行压缩编码,以得到当前视频帧的第二特征的第二压缩信息,当前视频帧的参考帧用于当前视频帧的第二特征的生成过程。
本申请实施例中,编码器在得到当前视频帧后,可以通过多个神经网络中的第二神经网络对当前视频帧进行压缩编码,以得到当前视频帧的第二特征的第二压缩信息。其中,当前视频帧的第二特征的含义、当前视频帧的第二特征的第二压缩信息的含义以及步骤701的具体实现方式均可以参阅图3对应实施例中的描述,此处不做赘述。
704、编码器通过第二神经网络生成第二视频帧,第二视频帧为当前视频帧的重建帧。
本申请的一些实施例中,编码器在通过第二神经网络生成当前视频帧的第二特征的第二压缩信息后,还可以通过第二神经网络进行解压缩处理,以生成第二视频帧,第二视频帧为当前视频帧的重建帧。
其中,第二神经网络还可以包括熵解码层和卷积网络,通过熵解码层对第二压缩信息进行熵解码,通过卷积网络利用当前视频帧的参考帧和当前视频帧的第二特征执行当前视频帧的重建帧的生成过程。
具体的,编码器可以通过熵解码层对第二压缩信息进行熵解码,得到当前视频帧的第二特征,也即得到了原始的当前视频帧相对于当前视频帧的参考帧的光流;可选地,当前视频帧的第二特征还包括原始的当前视频帧与预测的当前视频帧之间的残差。
编码器根据原始的当前视频帧相对于当前视频帧的参考帧的光流和当前视频帧的参考帧,对当前视频帧进行预测,得到预测的当前视频帧;编码器还会根据原始的当前视频帧与预测的当前视频帧之间的残差和预测的当前视频帧,生成第二视频帧(也即当前视频帧的重建帧)。
705、编码器根据第一压缩信息、第一视频帧、第二压缩信息和第二视频帧,确定与当前视频帧对应的目标压缩信息,其中,确定的目标压缩信息是通过第一神经网络得到的,确定的目标压缩信息为第一压缩信息;或者,确定的目标压缩信息是通过第二神经网络得到的,确定的目标压缩信息为第二压缩信息。
本申请实施例中,编码器可以根据第一压缩信息和第一视频帧,计算与第一压缩信息对应的第一评分值(也即第一神经网络所对应的第一评分值),根据第二压缩信息和第二视频帧,计算与第二压缩信息对应的第二评分值(也即第二神经网络所对应的第二评分值),编码器根据第一评分值和第二评分值,确定与当前视频帧对应的目标压缩信息。其中,若确定的目标压缩信息为通过第一神经网络得到的第一压缩信息,则目标神经网络为第一神经网络;或者,若确定的目标压缩信息是通过第二神经网络得到的第二压缩信息,则目标神经网络为第二神经网络。
第一评分值用于反映采用第一神经网络对当前视频帧执行压缩操作的性能,第二评分值用于反映采用第二神经网络对当前视频帧执行压缩操作的性能。进一步地,第一评分值的取值越低,证明通过第一神经网络处理当前视频帧的性能越好,第一评分值的取值越高, 证明通过第一神经网络处理当前视频帧的性能越差;第二评分值的取值越低,证明通过第二神经网络处理当前视频帧的性能越好,第二评分值的取值越高,证明通过第二神经网络处理当前视频帧的性能越差。
针对第一评分值和第二评分值的计算过程。具体的,编码器在得到第一压缩信息和第一视频帧后,可以得到第一压缩信息的数据量,计算第一压缩信息相对于当前视频帧的第一压缩率,并计算第一视频帧的图像质量,进而根据第一压缩信息相对于当前视频帧的第一压缩率和第一视频帧的图像质量,生成第一评分值。其中,第一压缩信息的数据量越大,则第一评分值的取值越大;第一压缩信息的数据量越小,则第一评分值的取值越小。第一视频帧的图像质量越低,第一评分值的取值越大,第一视频帧的图像质量越高,第一评分值的取值越小。
进一步地,第一压缩信息相对于当前视频帧的第一压缩率指的可以为第一压缩信息的数据量与当前视频帧的数据量之间的比值。
编码器可以计算当前视频帧与第一视频帧之前的结构相似性(structural similarity index,SSIM),以根据“结构相似性”这一指标来指示第一视频帧的图像质量,需要说明的是,编码器还可以通过其他指标来衡量第一视频帧的图像质量,作为示例,例如“结构相似性”这一指标还可以被替换为多尺度结构相似性(multiscale structural similarity index,MS-SSIM)、峰值信噪比(peak signal to noise ratio,PSNR)或其他指标等等,此处不做穷举。
编码器在得到第一压缩信息相对于当前视频帧的第一压缩率和第一视频帧的图像质量之后,可以将第一压缩率和第一视频帧的图像质量进行加权求和,以生成与第一神经网络对应的第一评分值。需要说明的是,编码器在得到第一压缩率和第一视频帧的图像质量之后,还可以采用其他方式来得到第一评分值,作为示例,例如将第一压缩率和第一视频帧的图像质量相乘等,具体根据第一压缩率和第一视频帧的图像质量得到第一评分值的方式,可以结合实际应用场景灵活确定,此处不做穷举。
对应地,编码器在得到第二压缩信息和第二视频帧后,可以计算第二压缩信息的数据量以及第二视频帧的图像质量,进而根据第二压缩信息的数据量和第二视频帧的图像质量,生成第二评分值;第二评分值的生成方式与第一评分值的生成方式类似,可参阅上述描述,此处不做赘述。
针对根据第一评分值和第二评分值,确定与当前视频帧对应的目标压缩信息的过程。具体的,在一种实现方式中,编码器在得到计算与第一压缩信息对应的第一评分值,和与第二压缩信息对应的第二评分值后,可以从第一评分值和第二评分值中选择取值较小的目标评分值,并将与目标评分值对应的压缩信息确定为目标压缩信息。编码器针对视频序列中的每个视频帧都执行前述操作,以得到每个视频帧所对应的目标压缩信息。
在另一种实现方式中,由于技术人员在研究中发现,请参阅图7b,图7b为本申请实施例提供的视频帧的压缩方法中第一评分值和第二评分值的一个示意图。其中,图7b的横坐标代表一个视频帧在当前视频序列中的位置信息,图7b的纵坐标代表与每个视频帧对应的评分值,A1代表在对当前视频序列中的多个视频帧进行压缩处理的过程中第一评分值所 对应的折线,A2代表在对当前视频序列中的多个视频帧进行压缩处理的过程中第二评分值所对应的折线。A3代表与分别采用第一神经网络和第二神经网络对视频帧1进行压缩处理时,所得到的第一评分值和第二评分值,通过图7b可知,采用第一神经网络对视频帧1所得到的评分值更低,因此编码器会采用第一神经网络对视频帧1进行处理,在采用第一神经网络对视频帧1进行处理后,与视频帧2(也即当前视频序列中视频帧1的下一个视频帧)对应的第一评分值和第二评分值均大幅下降;也即每当采用第一神经网络对一个视频帧执行压缩操作之后,就会触发开启一个新的周期。在一个周期内,第一评分值的取值呈线性增长,第二评分值的取值也呈线性增长,且第二评分值的增长率高于第一评分值的增长率。应理解,图7b中的示例仅为方便理解本方案,不用于限定本方案。
为了能更直观地理解本方案,在一个周期内,多个第一评分值可以拟合成如下公式:
l pi+t*k pi;(1)
其中,l pi代表与一个周期内的多个第一评分值对应的直线的起始点,也即与多个第一评分值对应的第一拟合公式的偏移量,k pi代表与一个周期内的多个第一评分值对应的直线的斜率,也即与多个第一评分值对应的第一拟合公式的系数,t代表一个周期内的任一个当前视频帧与该周期内第一个视频帧之间的间隔的视频帧的数量,作为示例,例如一个周期内的第二个视频帧所对应的t的取值为1。
在一个周期内,多个第二评分值可以拟合成如下公式:
l pr+t*k pr;(2)
其中,l pi代表与一个周期内的多个第二评分值对应的直线的起始点,也即与多个第二评分值对应的第二拟合公式的偏移量,k pi代表与一个周期内的多个第二评分值对应的直线的斜率,也即与多个第二评分值对应的第二拟合公式的系数,t的含义参阅上述对式(1)的描述。
一个周期所对应的总的评分值可以拟合成公式:
loss=l pr+(l pr+k pr)+…+(l pr+(T-2)*k pr)+l pi+(T-1)*k pi;(3)
其中,loss代表一个周期内所有评分值的和,T代表一个周期内的视频帧的总数量,由于当采用第一神经网络对一个视频帧进行压缩处理时,会触发进入一个新的周期,则一个周期内的前T-1个视频帧是采用第二神经网络进行压缩处理的,最后一个视频帧是采用第一神经网络进行压缩处理的,因此,l pr+(l pr+k pr)+…+(l pr+(T-2)*k pr)代表一个周 期内采用第二神经网络进行压缩处理的所有视频帧所对应的至少一个第二评分值的和,l pi+(T-1)*k pi代表一各周期内与最后一个视频帧对应的第一评分值。
则编码器可以将一个周期作为计算单位,目标为使得每个周期内的总评分值的平均值最小。为更直观地理解本方案,以下通过公式的形式展示:
Figure PCTCN2021112077-appb-000001
其中,T的含义和loss的含义可参阅上述对式(3)的描述,此处不做赘述,
Figure PCTCN2021112077-appb-000002
代表目标为一个周期内的总评分值的平均值的取值最小。
将式(3)带入式(4)可以得到如下公式:
Figure PCTCN2021112077-appb-000003
其中,技术人员在研究过程中发现l pi>l pr,且k pr>k pi,则b>0,且a>0,因此,当
Figure PCTCN2021112077-appb-000004
时,每个周期内的总评分值的平均值最小。
结合上述公式推理,具体的,在一种实现方式中,针对与当前视频序列对应的多个周期中的任一个周期,编码器先获取与一个周期内的前两个当前视频帧对应的两个第一评分值,编码器获取与一个周期内的前两个当前视频帧对应的两个第二评分值;对于与一个当前视频帧对应的第一评分值和与一个当前视频帧对应的第二评分值的获取方式可参阅上述描述,此处不做赘述。
编码器根据与一个周期内的前两个当前视频帧对应的两个第一评分值,生成与一个周期内多个第一评分值对应的第一拟合公式的系数和偏移量的值,也即可以生成l pi和k pi的值。编码器根据与一个周期内的前两个当前视频帧对应的两个第二评分值,生成与一个周期内多个第二评分值对应的第二拟合公式的系数和偏移量的值,也即可以生成l pr和k pr的值。
针对编码器在得到第一拟合公式的系数和偏移量的值和第二拟合公式的系数和偏移量的值之后,确定当前视频帧的目标压缩信息的过程。在一种实现方式中,当t等于0时,编码器将与一个周期内的第一个视频帧对应的第二压缩信息确定为该当前视频帧(也即一个周期内的第一个视频帧)的目标压缩信息,也即与一个周期内的第一个视频帧对应的目标神经网络为第二神经网络,并继续对t等于1时的情况进行处理。
当t等于1时,也即编码器获取到与一个周期内的前两个当前视频帧对应的两个第一评分值,并且获取到与一个周期内的前两个当前视频帧对应的两个第二评分值后,可以基于公式(5),计算T的取值。若T<3,则编码器将与一个周期内的第二个视频帧对应的第一压缩信息确定为该当前视频帧(也即一个周期内的第二个视频帧)的目标压缩信息,也即与一个周期内的第二个视频帧对应的目标神经网络为第一神经网络,并触发进入下一个周期。
若T≥3,则编码器将与一个周期内的第二个视频帧对应的第二压缩信息确定为该当前视频帧(也即一个周期内的第二个视频帧)的目标压缩信息,也即与一个周期内的第二个视频帧对应的目标神经网络为第二神经网络,并继续对t等于2时的情况进行处理。
当t等于2时,编码器获取与一个周期内第三个视频帧(也即当前视频帧的一种示例)对应的第一评分值和第二评分值,具体一个当前视频帧所对应的第一评分值和第二评分值的生成方式,此处不做赘述。编码器根据与一个周期内前三个视频帧对应的三个第一评分值,重新计算第一拟合公式的系数和偏移量的值(也即重新计算的l pi和k pi的值),根据与一个周期内前三个视频帧对应的三个第二评分值,重新计算第二拟合公式的系数和偏移量的值(也即重新计算的l pr和k pr的值),并根据重新计算的第一拟合公式的系数和偏移量的值和重新计算的第二拟合公式的系数和偏移量的值,重新计算T的取值。
若T<t+2,则编码器可以将与一个周期内的第三个视频帧对应的第一压缩信息确定为该当前视频帧(也即一个周期内的第三个视频帧)的目标压缩信息,也即与一个周期内的第三个视频帧对应的目标神经网络为第一神经网络,并触发进入下一个周期。
若T≥t+2,则编码器可以将与一个周期内的第三个视频帧对应的第二压缩信息确定为该当前视频帧(也即一个周期内的第三个视频帧)的目标压缩信息,也即与一个周期内的第三个视频帧对应的目标神经网络为第二神经网络,并继续对t等于3时的情况进行处理。
当t的取值为3、4或更大的数值时,编码器的处理方式与t等于2时的处理方式类似,此处不做赘述。
在另一种实现方式中,当t等于0时,编码器将与一个周期内的第一个视频帧对应的第二压缩信息确定为该当前视频帧(也即一个周期内的第一个视频帧)的目标压缩信息,也即与一个周期内的第一个视频帧对应的目标神经网络为第二神经网络,并继续对t等于1时的情况进行处理。
当t等于1时,编码器可以获取到与一个周期内的前两个当前视频帧对应的两个第一评分值,并且获取到与一个周期内的前两个当前视频帧对应的两个第二评分值后,计算得到第一拟合公式的系数和偏移量的值(也即l pi和k pi的值),以及第二拟合公式的系数和偏 移量的值(也即l pr和k pr的值),基于公式(5),计算采用第一神经网络对该周期内的第二个视频帧(也即当前视频帧的一个示例)进行压缩处理后得到整个周期的总评分值的第一平均值,并计算采用第二神经网络对该周期内的第二个视频帧(也即当前视频帧的一个示例)进行压缩处理,且采用第一神经网络对该周期内的第三个视频帧进行压缩处理后得到的整个周期的总评分值的第二平均值。
若第一平均值大于第二平均值,则编码器确定该周期内的第二个视频帧对应的目标压缩信息为该当前视频帧的第一压缩信息,也即与该周期内的第二个视频帧对应的目标神经网络为第一神经网络,并触发进入新的周期。
若第一平均值等于第二平均值,则编码器可以将该周期内的第二个视频帧对应的第一压缩信息确定为该当前视频帧的目标压缩信息,也即与该周期内的第二个视频帧对应的目标神经网络为第一神经网络,并触发进入新的周期。或者,编码器也可以将与该周期内的第二个视频帧对应的第二压缩信息确定为该当前视频帧的目标压缩信息,也即与该周期内的第二个视频帧对应的目标神经网络为第二神经网络,并继续对t等于2的情况进行处理。
若第一平均值小于第二平均值,则编码器将与该周期内的第二个视频帧对应的第二压缩信息确定为该当前视频帧的目标压缩信息,也即与该周期内的第二个视频帧对应的目标神经网络为第二神经网络,并继续对t等于2的情况进行处理。
当t等于2时,编码器可以获取与一个周期内的第三个视频帧对应的第一评分值,并且获取到与一个周期内的前两个当前视频帧对应的第二评分值,具体一个当前视频帧所对应的第一评分值和第二评分值的生成方式,此处不做赘述。编码器根据与一个周期内前三个视频帧对应的三个第一评分值,重新计算第一拟合公式的系数和偏移量的值(也即重新计算的l pi和k pi的值),根据与一个周期内前三个视频帧对应的三个第二评分值,重新计算第二拟合公式的系数和偏移量的值(也即重新计算的l pr和k pr的值),并根据重新计算的第一拟合公式的系数和偏移量的值和重新计算的第二拟合公式的系数和偏移量的值,计算更新后的第一平均值和更新后的第二平均值。其中,更新后的第一平均值为采用第一神经网络对该周期内的第三个视频帧(也即当前视频帧的一个示例)进行压缩处理后得到整个周期的总评分值的平均值,更新后的第二平均值为采用第二神经网络对该周期内的第三个视频帧(也即当前视频帧的一个示例)进行压缩处理,且采用第一神经网络对该周期内的第四个视频帧进行压缩处理后得到的整个周期的总评分值的平均值。
若更新后的第一平均值大于更新后的第二平均值,则编码器确定该周期内的第三个视频帧对应的目标压缩信息为该当前视频帧的第一压缩信息,也即与该周期内的第三个视频帧对应的目标神经网络为第一神经网络,并触发进入新的周期。
若更新后的第一平均值等于更新后的第二平均值,则编码器可以将该周期内的第三个 视频帧对应的第一压缩信息确定为该当前视频帧的目标压缩信息,也即与该周期内的第三个视频帧对应的目标神经网络为第一神经网络,并触发进入新的周期。或者,编码器也可以将与该周期内的第三个视频帧对应的第二压缩信息确定为该当前视频帧的目标压缩信息,也即与该周期内的第三个视频帧对应的目标神经网络为第二神经网络,并继续对t等于3的情况进行处理。
若更新后的第一平均值小于更新后的第二平均值,则编码器将与该周期内的第三个视频帧对应的第二压缩信息确定为该当前视频帧的目标压缩信息,也即与该周期内的第三个视频帧对应的目标神经网络为第二神经网络,并继续对t等于3的情况进行处理。
当t的取值为3、4或更大的数值时,编码器的处理方式与t等于2时的处理方式类似,此处不做赘述。
本申请实施例中,技术人员在研究中发现单个周期内的第一评分值和第二评分值的变化规律,并将一个周期内的总评分值的平均值最低做为优化目标,也即在确定与每个当前视频帧对应的目标压缩信息时,不仅要考虑当前视频帧的评分值,还会考虑整个周期内的评分值的平均值,以进一步降低与整个当前视频序列中所有视频帧所对应的评分值,以进一步提高整个当前视频序列所对应的压缩信息的性能;此外,提供了两种不同的实现方式,提高了本方案的实现灵活性。
在另一种实现方式中,编码器也是以一个周期作为计算单位,目标为使得每个周期内的总评分值的平均值最小。且对于t等于0和1时的具体实现方式,可参阅B情况中第一个实现方式中的描述,此处不做赘述。
若编码器执行到t=2的情况,编码器不再获取与一个周期内第三个视频帧(也即当前视频帧的一种示例)对应的第一评分值和第二评分值,也不再重新计算第一拟合公式的系数和偏移量的值以及第二拟合公式的系数和偏移量的值,而是直接获取t=1的情况中计算得到的T的取值,若T<t+2,则编码器可以将与一个周期内的第三个视频帧对应的第一压缩信息确定为该当前视频帧(也即一个周期内的第三个视频帧)的目标压缩信息,也即与一个周期内的第三个视频帧对应的目标神经网络为第一神经网络,并触发进入下一个周期。
若T≥t+2,则编码器可以将与一个周期内的第三个视频帧对应的第二压缩信息确定为该当前视频帧(也即一个周期内的第三个视频帧)的目标压缩信息,也即与一个周期内的第三个视频帧对应的目标神经网络为第二神经网络,并继续对t等于3时的情况进行处理。
当t的取值为3、4或更大的数值时,编码器的处理方式与t等于2时的处理方式类似,此处不做赘述。
为更直观地理解本方案,请参阅图7c,图7c为本申请实施例提供的视频帧的压缩方法中计算第一拟合公式的系数和偏移量的值以及第二拟合公式的系数和偏移量的一个示意图。如图7c所示,两个垂直方向的虚线之间代表的是对一个周期内的视频帧进行处理,一个周期内包括通过第二神经网络对多个视频帧进行压缩编码,和通过第一神经网络对周期内最后一个视频帧进行压缩编码。编码器先获取到与一个周期内的前两个当前视频帧(也即第一个视频帧和第二个视频帧)对应的两个第一评分值,并且获取到与一个周期内的前两个当前视频帧对应的两个第二评分值后,计算得到第一拟合公式的系数和偏移量的值(也 即l pi和k pi的值),以及第二拟合公式的系数和偏移量的值(也即l pr和k pr的值),并基于公式(5),计算该周期内T的最优值。编码器执行到t=2的情况,不再获取与一个周期内第三个视频帧对应的第一评分值和第二评分值,也不再重新计算第一拟合公式的系数和偏移量的值以及第二拟合公式的系数和偏移量的值,应理解,图7c中的示例仅为方便理解本方案,不用于限定本方案。
本申请实施例中,在一个周期内,仅根据与一个周期内的前两个视频帧对应的两个第一评分值和两个第二评分值,计算得到第一拟合公式的系数和偏移量的值以及第二拟合公式的系数和偏移量的值,进而以整个周期内的总评分值的平均值最低为优化目标,得到当前周期内最优的视频帧的数量,由于仍然是以整个周期内的总评分值的平均值最低为优化目标,所以仍然能够进一步降低与整个当前视频序列中所有视频帧所对应的评分值;且由于从t=2的情况后,不再更新第一拟合公式的系数和偏移量的值以及第二拟合公式的系数和偏移量的值,也即节省了第一拟合公式和第二拟合公式的参数的计算时长,从而提高了生成当前视频序列的压缩信息的效率。
在另一种实现方式中,编码器也是以一个周期作为计算单位,目标为使得每个周期内的总评分值的平均值最小。且对于t等于0和1时的具体实现方式,可参阅B情况中第一个实现方式中的描述,此处不做赘述。
若编码器执行到t=2的情况,编码器只获取与一个周期内第三个视频帧(也即当前视频帧的一种示例)对应的第二评分值,不再获取与一个周期内第三个视频帧(也即当前视频帧的一种示例)对应的第一评分值;进而只重新计算第二拟合公式的系数和偏移量的值,不再重新计算第一拟合公式的系数和偏移量的值;编码器根据未更新的第一拟合公式和更新后的第二拟合公式,计算t=2的情况中T的取值。若T<t+2,则编码器可以将与一个周期内的第三个视频帧对应的第一压缩信息确定为该当前视频帧(也即一个周期内的第三个视频帧)的目标压缩信息,也即与一个周期内的第三个视频帧对应的目标神经网络为第一神经网络,并触发进入下一个周期。
若T≥t+2,则编码器可以将与一个周期内的第三个视频帧对应的第二压缩信息确定为该当前视频帧(也即一个周期内的第三个视频帧)的目标压缩信息,也即与一个周期内的第三个视频帧对应的目标神经网络为第二神经网络,并继续对t等于3时的情况进行处理。
当t的取值为3、4或更大的数值时,编码器的处理方式与t等于2时的处理方式类似,此处不做赘述。
本申请实施例中,根据至少一个当前视频帧的第一压缩信息、第一视频帧、当前视频帧的第二压缩信息以及第二视频帧,选取最终需要发送的压缩信息;相对于按照预定的网络选择策略从第一神经网络和第二神经网络中确定目标神经网络,再利用目标神经网络生成目标压缩信息的方式,能够尽量提高整个当前视频序列所对应的压缩信息的性能。
706、编码器生成与目标压缩信息对应的指示信息,指示信息用于指示目标压缩信息通过第一神经网络和第二神经网络中的目标神经网络得到。
707、编码器发送当前视频帧的目标压缩信息。
708、编码器发送与当前视频帧的目标压缩信息对应的指示信息。
本申请实施例中,步骤706和708为必选步骤,步骤706至步骤708的具体实现方式可参阅图3对应实施例中对步骤303至305的描述,此处不做赘述。需要说明的是,本申请实施例不限定步骤707和708的执行顺序,可以同时执行步骤707和708,也可以先执行步骤707,再执行步骤708,也可以先执行步骤708,再执行步骤707。
本申请实施例中,根据至少一个当前视频帧的第一压缩信息、第一视频帧、当前视频帧的第二压缩信息以及第二视频帧,从第一压缩信息和第二压缩信息中选取最终需要发送的压缩信息;相对于按照网络选择策略从多个神经网络中确定目标神经网络,再利用目标神经网络生成目标压缩信息的方式,能够尽量提高整个当前视频序列所对应的压缩信息的性能。
本申请实施例中,请参阅图8,图8为本申请实施例提供的视频帧的压缩方法的另一种流程示意图,本申请实施例提供的视频帧的压缩方法可以包括:
801、编码器通过第一神经网络对第三视频帧进行压缩编码,以得到与第三视频帧对应的第一压缩信息,第一压缩信息包括第三视频帧的第一特征的压缩信息,第三视频帧的参考帧用于第三视频帧的第一特征的压缩过程。
本申请实施例中,编码器在处理到当前视频帧中的第三视频帧时,确定第三视频帧的目标压缩信息为由第一神经网络生成的第三视频帧所对应的第一压缩信息。其中,第三视频帧为当前视频序列中的一个视频帧,第三视频帧的概念与当前视频帧的概念类似,第三视频帧的第一特征的含义可参阅上述图3对应实施例中对“当前视频帧的第一特征”的含义的介绍,“第三视频帧的参考帧”的含义、编码器生成第三视频帧对应的第一压缩信息的具体实现方式,以及编码器确定最后需要发送给解码器的第三视频帧的压缩信息的具体实现方式可参阅图3对应实施例中的描述,此处不做赘述。
802、编码器通过第二神经网络对第四视频帧进行压缩编码,以得到与第四视频帧对应的第二压缩信息,第二压缩信息包括第四视频帧的第二特征的压缩信息,第四视频帧的参考帧用于第四视频帧的第二特征的生成过程,第三视频帧和第四视频帧为同一视频序列中不同的视频帧。
本申请实施例中,编码器在处理到当前视频帧中的第四视频帧时,确定第四视频帧的目标压缩信息为由第二神经网络生成的第四视频帧所对应的第二压缩信息。其中,第四视频帧为当前视频序列中的一个视频帧,第四视频帧的概念与当前视频帧的概念类似,第三视频帧和第四视频帧为同一当前视频序列中不同的视频帧。
其中,第四视频帧的第二特征的含义可参阅上述图3对应实施例中对“当前视频帧的第二特征”的含义的介绍,“第四视频帧的参考帧”的含义、编码器生成第四视频帧对应的第二压缩信息的具体实现方式,以及编码器确定最后需要发送给解码器的第四视频帧的压缩信息的具体实现方式可参阅图3对应实施例中的描述,此处不做赘述。
需要说明的是,本申请实施例不限定步骤801和802的具体实现顺序,可以先执行步骤801,再执行步骤802,也可以先执行步骤802,再执行步骤801,具体需要结合实际应 用场景确定,此处不做限定。
803、编码器生成指示信息,指示信息用于指示第一压缩信息通过第一神经网络得到且第二压缩信息通过第二神经网络得到。
本申请实施例中,与图3对应实施例中的步骤303类似,编码器在生成当前视频序列中的一个或多个当前视频帧的目标压缩信息后,可以生成与一个或多个目标压缩信息一一对应的指示信息,其中,目标压缩信息具体表现为第一压缩信息或第二压缩信息,前述目标压缩信息以及指示信息的含义可参阅上述图3对应实施例中步骤303中的描述,此处不做赘述。
具体的,编码器可以先执行步骤801和802多次后,再通过步骤803生成与整个当前视频序列中每个视频帧的目标压缩信息一一对应的指示信息。或者,编码器也可以为在每执行一次步骤801或执行一次步骤802后,就执行一次步骤803。或者,编码器也可以在执行步骤801和/或步骤802达到预设次数后,执行一次步骤803,该预设次数的取值为大于1的整数,作为示例,例如可以为3、4、5、6或其他数值等,此处不做限定。
需要说明的是,若步骤801或802中,编码器为采用图7a对应实施例中示出的方式来确定当前视频帧(也即第三视频帧或第四视频帧)的目标压缩信息,则步骤803为必选步骤。若步骤801或802中,编码器为采用图3对应实施例中示出的方式来获取当前视频帧(也即第三视频帧或第四视频帧)的目标压缩信息,则步骤803为可选步骤。步骤803的具体实现方式可参阅图3对应实施例中步骤303的描述,此处不做赘述。
804、编码器发送与当前视频帧对应的目标压缩信息,目标压缩信息为第一压缩信息或第二压缩信息。
本申请实施例中,编码器在生成与至少一个第三视频帧一一对应的至少一个第一压缩信息,和/或编码器在生成与至少一个第四视频帧一一对应的至少一个第二压缩信息之后,可以基于FTP协议的约束,向解码器发送与至少一个当前视频帧(也即第三视频帧和/或第四视频帧)一一对应的至少一个目标压缩信息(也即第一压缩信息和/或第二压缩信息)。步骤804的具体实现方式可参阅图3对应实施例中步骤304中的描述,此处不做赘述。
为更直观地理解本方案,请参阅图9,图9为本申请实施例提供的视频帧的压缩方法的一个示意图。如图9所示,编码器采用第三神经网络对当前视频序列中的部分视频帧进行压缩编码,采用第四神经网络对当前视频序列中的另一部分视频帧进行压缩编码,进而发送与当前视频序列中所有当前视频帧所对应的目标压缩信息,目标压缩信息为第一压缩信息或第二压缩信息,应理解,图9中的示例仅为方便理解本方案,不用于限定本方案。
805、编码器发送与与当前视频帧对应的指示信息。
本申请实施例中,步骤805为可选步骤,若未执行步骤803,则不执行步骤805,若执行步骤803,则执行步骤805。若执行步骤805,则步骤805可以与步骤804可以同时执行,步骤805的具体实现方式可参阅上述图3对应实施例中对步骤305的描述,此处不做赘述。
本申请实施例中,当通过第一神经网络对当前视频序列中的第三视频帧进行压缩编码时,第一压缩信息携带的是当前视频帧的第一特征的压缩信息,而当前视频帧的参考帧仅用于当前视频帧的第一特征的压缩过程,不用于当前视频帧的第一特征的生成过程,从而 解码器在根据第一压缩信息执行解压缩操作以得到当前视频帧的第一特征后,不需要借助当前视频帧的参考帧就能够得到当前视频帧的重建帧,所以当目标压缩信息通过第一神经网络得到时,该当前视频帧的重建帧的质量不会依赖于该当前视频帧的参考帧的重建帧的质量,进而避免了误差在逐帧之间累积,以提高视频帧的重建帧的质量;当通过第二神经网络对第四视频帧进行压缩编码时,由于第四视频帧的第二特征是根据第四视频帧的参考帧生成的,第二压缩信息所对应的数据量比第一压缩信息所对应的数据量小,同时利用第一神经网络和第二神经网络,来对当前视频序列中不同的视频帧进行处理,以综合第一神经网络和第二神经网络的优点,以实现在尽量减少需要传输的数据量的基础上,提高视频帧的重建帧的质量。
接下来,结合图10a至图12对解码器执行的步骤进行详细描述,图10a为本申请实施例提供的视频帧的解压缩方法的一种流程示意图,本申请实施例提供的视频帧的解压缩方法可以包括:
1001、解码器接收与至少一个当前视频帧对应的目标压缩信息。
本申请实施例中,编码器可以在FTP协议的约束下,向解码器发送与当前视频序列中的至少一个当前视频帧对应的至少一个目标压缩信息;对应的,解码器可以接收与当前视频序列中的至少一个当前视频帧一一对应的至少一个目标压缩信息。
具体的,在一种实现方式中,解码器可以直接从编码器接收与至少一个当前视频帧对应的目标压缩信息;在另一种实现方式中,解码器也可以从服务器或管理中心等中间设备处,接收与至少一个当前视频帧对应的目标压缩信息。
1002、解码器接收与目标压缩信息对应的指示信息。
本申请的一些实施例中,若编码器发送与至少一个目标压缩信息一一对应的至少一个指示信息,对应的,解码器会接收与至少一个目标压缩信息一一对应的至少一个指示信息。对于指示信息的含义可参阅图3对应实施例中的描述,此处不做赘述。
需要说明的是,步骤1002为可选步骤,若执行步骤1002,本申请实施例不限定步骤1001和1002的执行顺序,可以同时执行步骤1001和1002。
1003、解码器从多个神经网络中选择与当前视频帧对应的目标神经网络,多个神经网络包括第三神经网络和第四神经网络。
本申请实施例中,解码器在得到与至少一个当前视频帧一一对应的至少一个目标压缩信息之后,需要通过从多个神经网络中选择出一个目标神经网络来执行解压缩操作,以得到每个当前视频帧的重建帧。其中,多个神经网络包括第三神经网络和第四神经网络,第三神经网络和第四神经网络均为用于执行解压缩操作的神经网络。
进一步地,第三神经网络与第一神经网络对应,也即若一个当前视频帧的目标压缩信息是通过第一神经网络得到的该当前视频帧的第一压缩信息,则解码器需要通过第三神经网络对该当前视频帧的第一压缩信息执行解压缩操作,以得到该当前视频帧的重建帧。
第四神经网络与第二神经网络对应,也即若一个当前视频帧的目标压缩信息是通过第二神经网络得到的该当前视频帧的第二压缩信息,则解码器需要通过第四神经网络对该当前视频帧的第二压缩信息进行解压缩操作,以得到该当前视频帧的重建帧。
需要说明的是,解码器通过第三神经网络或第四神经网络对目标压缩信息执行解压缩操作的具体实现方式,将在后续实施例中进行描述,此处暂时不做赘述。
针对解码器确定目标神经网络的过程。具体的,在一种实现方式中,若执行步骤1002,则解码器可以直接根据与多个目标压缩信息一一对应的多个指示信息,确定与每个目标压缩信息对应的目标神经网络为第一神经网络和第二神经网络中的哪一个神经网络。
为更直观地理解本方案,请参阅图10b,图10b为本申请实施例提供的视频帧的解压缩方法的另一种流程示意图。如图10b所示,解码器获取与当前视频帧对应的目标压缩信息以及与目标压缩信息对应的指示信息后,可以根据与目标压缩信息对应的指示信息,从第三神经网络和第四神经网络中确定目标神经网络,并利用目标神经网络对当前视频帧所对应的目标压缩信息进行解压缩,得到当前视频帧的重建帧,应理解,图10b中的示例仅为方便理解本方案,不用于限定本方案。
在另一种实现方式中,若未执行步骤1002,则解码器可以获取与每个目标压缩信息一一对应的当前视频帧在当前视频序列中的位置信息,位置信息用于指示与每个目标压缩信息一一对应的当前视频帧为当前视频序列中的第X帧;解码器根据预设规则,从第三神经网络和第四神经网络中,选取与当前视频序列的位置信息对应的目标神经网络。
其中,位置信息的含义均可以参阅图3对应实施例中的描述,此处不做赘述。预设规则可以为按照一定的规律交替选择第三神经网络或第四神经网络,也即解码器在采用第三神经网络对当前视频帧的n个视频帧进行压缩编码,再采用第四神经网络对当前视频帧的m个视频帧进行压缩编码;或者,编码器在采用第四神经网络对当前视频帧的m个视频帧进行压缩编码后,再采用第三神经网络对当前视频帧的n个视频帧进行压缩编码。n和m的取值均可以为大于或等于1的整数,n和m的取值可以相同或不同。
解码器根据预设规则,从包括第三神经网络和第四神经网络的多个神经网络中选取与当前视频序列的位置信息对应的目标神经网络的具体实现方式,与编码器根据网络选择策略,从包括第一神经网络和第二神经网络的多个神经网络中选取与当前视频序列的位置信息对应的目标神经网络的具体实现方式类似,区别在于,将图3对应实施例中的“第一神经网络”替换为本实施例中的“第三神经网络”,将图3对应实施例中的“第二神经网络”替换为本实施例中的“第四神经网络”,可直接参阅上述图3对应实施例中的描述,此处不做赘述。
1004、解码器根据目标压缩信息,通过目标神经网络执行解压缩操作,以得到当前视频帧的重建帧,其中,若目标神经网络为第三神经网络,则目标压缩信息包括当前视频帧的第一特征的第一压缩信息,当前视频帧的参考帧用于第一压缩信息的解压缩过程,以得到当前视频帧的第一特征,当前视频帧的第一特征用于当前视频帧的重建帧的生成过程;若目标神经网络为第四神经网络,则目标压缩信息包括当前视频帧的第二特征的第二压缩信息,第二压缩信息用于供解码器执行解压缩操作以得到当前视频帧的第二特征,当前视频帧的参考帧和当前视频帧的第二特征用于当前视频帧的重建帧的生成过程。
本申请实施例中,若目标神经网络为第三神经网络,目标压缩信息包括当前视频帧的第一特征的第一压缩信息,第三神经网络包括熵解码层和解码网络;其中,通过熵解码层 利用当前视频帧的参考帧执行当前视频帧的第一压缩信息的熵解码过程,通过解码网络利用当前视频帧的第一特征生成当前视频帧的重建帧。
具体的,在目标神经网络为第三神经网络这一情况下,解码器执行步骤1004的具体实现方式可参阅图7a对应实施例中步骤702的描述,区别在于,步骤702中是编码器根据当前视频帧所对应的第一压缩信息,通过第一神经网络进行解压缩处理,得到当前视频帧的重建帧;步骤1004中是解码器根据当前视频帧所对应的第一压缩信息,通过第三神经网络进行解压缩处理,得到当前视频帧的重建帧。
若目标神经网络为第四神经网络,则目标压缩信息包括当前视频帧的第二特征的第二压缩信息,第四神经网络包括熵解码层和卷积网络;其中,通过熵解码层对第二压缩信息进行熵解码,通过卷积网络利用当前视频帧的参考帧和当前视频帧的第二特征执行当前视频帧的重建帧的生成过程。
具体的,在目标神经网络为第四神经网络这一情况下,解码器执行步骤1004的具体实现方式可参阅图7a对应实施例中步骤704的描述,区别在于,步骤704中是编码器根据当前视频帧所对应的第二压缩信息,通过第二神经网络进行解压缩处理,得到当前视频帧的重建帧;步骤1004中是解码器根据当前视频帧所对应的第二压缩信息,通过第四神经网络进行解压缩处理,得到当前视频帧的重建帧。
本申请实施例还提供了一种视频帧的解压缩方法,请参阅图11,图11为本申请实施例提供的视频帧的解压缩方法的另一种流程示意图,本申请实施例提供的视频帧的解压缩方法可以包括:
1101、解码器接收与当前视频帧对应的目标压缩信息,目标压缩信息为第一压缩信息或第二压缩信息。
1102、解码器接收与当前视频帧对应的指示信息,指示信息用于指示第一压缩信息通过第三神经网络进行解压缩且第二压缩信息通过第四神经网络进行解压缩。
本申请实施例中,步骤1101和1102的具体实现方式,可参阅图10a对应实施例中步骤1001和1002的描述,此处不做赘述。
1103、解码器通过第三神经网络对第三视频帧的第一压缩信息进行解压缩,以得到第三视频帧的重建帧。
本申请实施例中,解码器从多个神经网络中选择由第三神经网络对第三视频帧的第一压缩信息进行解压缩,“从多个神经网络中选择与第三视频帧对应的第三神经网络”的具体实现过程可参阅图10a对应实施例中步骤1003中的描述,此处不做赘述。
其中,第三神经网络包括熵解码层和解码网络,通过熵解码层利用当前视频帧的参考帧执行当前视频帧的第一压缩信息的熵解码过程,通过解码网络利用当前视频帧的第一特征生成当前视频帧的重建帧。解码器通过第三神经网络对第三视频帧的第一压缩信息进行解压缩的具体实现方式可参阅图7a对应实施例中对步骤702的描述,此处不做赘述。
第一压缩信息包括第三视频帧的第一特征的压缩信息,第三视频帧的参考帧用于第一压缩信息的解压缩过程,以得到第三视频帧的第一特征,第三视频帧的第一特征用于第三视频帧的重建帧的生成过程,第三视频帧的重建帧和第三视频帧的参考帧均包括于当前视 频序列。也即解码器在对第一压缩信息进行解压缩处理后,不再需要借助第三视频帧的参考帧就能得到第三视频帧的重建帧。
进一步地,“第三视频帧的第一特征”的含义可参阅上述对“当前视频帧的第一特征”的含义进行理解,“第三视频帧的参考帧”的含义可参阅上述对“当前视频帧的参考帧”的含义进行理解,此处均不做赘述。第三视频帧的重建帧指的是利用第一压缩信息执行解压缩操作得到的与第三视频帧对应的视频帧。
1104、解码器通过第四神经网络对第四视频帧的第二压缩信息进行解压缩,以得到第四视频帧的重建帧。
本申请实施例中,解码器从多个神经网络中选择由第四神经网络对第四视频帧的第一压缩信息进行解压缩,“从多个神经网络中选择与第四视频帧对应的第四神经网络”的具体实现过程可参阅图10a对应实施例中步骤1003中的描述,此处不做赘述。
其中,第四神经网络包括熵解码层和卷积网络,通过熵解码层对第二压缩信息进行熵解码,通过卷积网络利用当前视频帧的参考帧和当前视频帧的第二特征执行当前视频帧的重建帧的生成过程。解码器通过第四神经网络对第四视频帧的第二压缩信息进行解压缩的具体实现方式可参阅图7a对应实施例中对步骤704的描述,此处不做赘述。
第二压缩信息包括第四视频帧的第二特征的压缩信息,第二压缩信息用于供解码器执行解压缩操作以得到第四视频帧的第二特征,第四视频帧的参考帧和第四视频帧的第二特征用于第四视频帧的重建帧的生成过程,第四视频帧的重建帧和第四视频帧的参考帧均包括于当前视频序列。
进一步地,“第四视频帧的第二特征”的含义可参阅上述对“当前视频帧的第二特征”的含义进行理解,“第四视频帧的参考帧”的含义可参阅上述对“当前视频帧的参考帧”的含义进行理解,此处均不做赘述。第四视频帧的重建帧指的是利用第二压缩信息执行解压缩操作得到的与第四视频帧对应的视频帧。
二、训练阶段
请参阅图12,图12为本申请实施例提供的视频帧的压缩以及解压缩系统的训练方法一种流程示意图,本申请实施例提供的视频帧的压缩以及解压缩系统的训练方法可以包括:
1201、训练设备通过第一神经网络对第一训练视频帧进行压缩编码,以得到与第一训练视频帧对应的第一压缩信息。
本申请实施例中,训练设备中预先存储有训练数据集合,训练数据集合中包括多个第一训练视频帧,步骤1201的具体实现方式可参阅图8对应实施例中步骤801的描述,此处不做赘述。区别在于,第一,将步骤801中的“第三视频帧”替换为本实施例中的“第一训练视频帧”;第二,步骤1201中训练设备不需要执行从第一神经网络和第二神经网络中选取目标神经网络的步骤,或者说,步骤1201中训练设备不需要执行从第一压缩信息和第二压缩信息中选取目标压缩信息的步骤。
1202、训练设备通过第三神经网络对第一训练视频帧的第一压缩信息进行解压缩,以得到第一训练重建帧。
本申请实施例中,训练设备执行步骤1202的具体实现方式可参阅图11对应实施例中 步骤1103的描述,此处不做赘述。区别在于,第一,将步骤1103中的“第三视频帧”替换为本实施例中的“第一训练视频帧”;第二,步骤1202中训练设备不需要执行从第三神经网络和第四神经网络中选取目标神经网络的步骤。
1203、训练设备根据第一训练视频帧、第一训练重建帧、第一压缩信息和第一损失函数,对第一神经网络和第三神经网络进行训练,直至满足预设条件。
本申请实施例中,训练设备根据第一训练视频帧、第一训练重建帧以及第一训练视频帧所对应的第一压缩信息,可以通过第一损失函数,对第一神经网络和第三神经网络进行迭代训练,直至满足第一损失函数的收敛条件。
其中,第一损失函数包括第一训练视频帧和第一训练重建帧之间的相似度的损失项和第一训练视频帧的第一压缩信息的数据大小的损失项,第一训练重建帧为第一训练视频帧的重建帧。第一损失函数的训练目标包括拉近第一训练视频帧和第一训练重建帧之间的相似度。第一损失函数的训练目标还包括减小第一训练视频帧的第一压缩信息的大小。第一神经网络指的是对视频帧进行压缩编码过程中所采用到的神经网络;第二神经网络指的是基于压缩信息,执行解压缩操作的神经网络。
具体的,训练设备根据第一训练视频帧、第一训练重建帧以及第一训练视频帧所对应的第一压缩信息,可以计算第一损失函数的函数值,并根据第一损失函数的函数值生成梯度值,继而反向更新第一神经网络和第三神经网络的权重参数,以完成对第一神经网络和第三神经网络的一次训练,训练设备通过重复执行步骤1201至1203,以实现对第一神经网络和第三神经网络的迭代训练。
1204、训练设备根据第二训练视频帧的参考帧,通过第二神经网络对第二训练视频帧进行压缩编码,得到与第二训练视频帧对应的第二压缩信息,第二训练视频帧的参考帧为训练后的第一神经网络处理过的视频帧。
本申请实施例中,训练设备执行步骤1202的具体实现方式可参阅图8对应实施例中步骤802的描述,此处不做赘述。区别在于,第一,将步骤802中的“第四视频帧”替换为本实施例中的“第二训练视频帧”;第二,步骤1204中训练设备不需要执行从第一神经网络和第二神经网络中选取目标神经网络的步骤,或者说,步骤1204中训练设备不需要执行从第一压缩信息和第二压缩信息中选取目标压缩信息的步骤。
其中,第二训练视频帧的参考帧可以为训练数据集合中的原始的视频帧,也可以为经过成熟的第一神经网络(也即执行过训练操作的第一神经网络)处理过的视频帧。
具体的,在一种实现方式中,由于第一神经网络包括第一编码网络,第三神经网络包括第一解码网络,则训练设备可以将第二训练视频帧的原始的参考帧输入成熟的第一神经网络(也即执行过训练操作的第一神经网络)中的第一编码网络中,以对第二训练视频帧进行编码操作,得到编码结果,将前述编码结果输入成熟的第三神经网络(也即执行过训练操作的第三神经网络)中的第一解码网络中,以对该编码结果执行解码操作,得到第二训练视频帧的处理后的参考帧。进而训练设备将前述第二训练视频帧的处理后的参考帧和第二训练视频帧输入至第二神经网络,以通过第二神经网络生成第二训练视频帧所对应的第二压缩信息。
在另一种实现方式中,训练设备可以将第二训练视频帧的原始的参考帧输入成熟的第一神经网络,以通过成熟的第一神经网络生成第二训练视频帧的原始的参考帧所对应的第一压缩信息,并利用成熟的第三神经网络,根据第二训练视频帧的原始的参考帧所对应的第一压缩信息执行解压缩操作,得到第二训练视频帧的处理后的参考帧。进而训练设备将前述第二训练视频帧的处理后的参考帧和第二训练视频帧输入至第二神经网络,以通过第二神经网络生成第二训练视频帧所对应的第二压缩信息。
本申请实施例中,由于在执行阶段,第二神经网络所采用的参考帧可能是经过第一神经网络处理过的,则采用由第一神经网络处理过的参考帧来对第二神经网络执行训练操作,有利于保持训练阶段和执行阶段的一致性,以提高执行阶段的准确率。
1205、训练设备通过第四神经网络对第二训练视频帧的第二压缩信息进行解压缩,以得到第二训练重建帧。
本申请实施例中,训练设备执行步骤1202的具体实现方式可参阅图11对应实施例中步骤1104的描述,此处不做赘述。区别在于,第一,将步骤1104中的“第四视频帧”替换为本实施例中的“第二训练视频帧”;第二,步骤1205中训练设备不需要执行从第三神经网络和第四神经网络中选取目标神经网络的步骤。
1206、训练设备根据第二训练视频帧、第二训练重建帧、第二压缩信息和第二损失函数,对第二神经网络和第四神经网络进行训练,直至满足预设条件。
本申请实施例中,训练设备根据第二训练视频帧、第二训练重建帧以及第二训练视频帧所对应的第二压缩信息,可以通过第二损失函数,对第二神经网络和第四神经网络进行迭代训练,直至满足第二损失函数的收敛条件。
其中,第二损失函数包括第二训练视频帧和第二训练重建帧之间的相似度的损失项和第二训练视频帧的第二压缩信息的数据大小的损失项,第二训练重建帧为第二训练视频帧的重建帧。第二损失函数的训练目标包括拉近第二训练视频帧和第二训练重建帧之间的相似度。第二损失函数的训练目标还包括减小第二训练视频帧的第二压缩信息的大小。第二神经网络指的是对视频帧进行压缩编码过程中所采用到的神经网络;第四神经网络指的是基于压缩信息,执行解压缩操作的神经网络。
具体的,训练设备根据第二训练视频帧、第二训练重建帧以及第二训练视频帧所对应的第二压缩信息,可以计算第二损失函数的函数值,并根据第二损失函数的函数值生成梯度值,继而反向更新第二神经网络和第四神经网络的权重参数,以完成对第二神经网络和第四神经网络的一次训练,训练设备通过重复执行步骤1204至1206,以实现对第二神经网络和第四神经网络的迭代训练。
由于第一神经网络和第三神经网络均由多个独立的神经网络模块组成,对应的,第二神经网络和第四神经网络也由多个独立的神经网络模块组成。其中,独立的神经网络模块指的是具有独立功能的神经网络模块,作为示例,例如第一神经网络中的第一编码网络就是一个独立的神经网络模块,作为另一示例,例如第二神经网络中的第一解码网络就是一个独立的神经网络模块。
则可选地,若第二神经网络和第四神经网络与第一神经网络和第三神经网络中存在相 同的神经网络模块,则可以先根据训练后的第一神经网络和训练后的第三神经网络初始化第二神经网络和第四神经网络的参数,也即将训练后的第一神经网络和训练后的第三神经网络中的参数赋值给前述相同的神经网络模块,并在第二神经网络和第四神经网络的训练过程中,保持前述相同的神经网络模块的参数不变,对第二神经网络和第四神经网络中的剩余神经网络模块的参数进行调整,以减少第二神经网络和第四神经网络的训练过程的总时长,提高第二神经网络和第四神经网络的训练效率。
本申请实施例中,不仅提供了神经网络的执行过程,还提供了神经网络的训练过程,扩展了本方案的应用场景,提高了本方案的全面性。
为了对本申请实施例所带来的有益效果有更为直观的认识,以下结合附图对本申请实施例所带来的有益效果作进一步的介绍,本实验中以每采用第一神经网络对一个视频帧执行压缩操作后,就采用第二神经网络对一个视频帧执行压缩操作为例,以下通过表1对实验数据进行展示。
表1
Figure PCTCN2021112077-appb-000005
其中,参阅上述表1可知,在三组不同的分辨率的视频序列中,采用本申请实施例提供的方案对视频序列中的视频帧进行压缩,相对于仅采用第二神经网络对视频序列中的视频帧进行压缩,图像质量均得到了提升。
本实验中以生成第一拟合公式的偏移量和斜率,并生成第二拟合公式的偏移量和斜率,且不断更新第一拟合公式和第二拟合公式的偏移量和斜率为例,以下通过表2对实验数据进行展示。
表2
Figure PCTCN2021112077-appb-000006
其中,参阅上述表2可知,在两组不同的分辨率的视频序列中,采用本申请实施例提供的方案对视频序列中的视频帧进行压缩,相对于仅采用第二神经网络对视频序列中的视频帧进行压缩,图像质量均得到了提升。
接下来介绍本申请实施例还提供了一种视频编解码系统,请参阅图13,图13为本申请实施例提供的视频编解码系统的一种系统架构图,图13为示例性视频编解码系统10的示意性框图,视频编解码系统10中的视频编码器20(或简称为编码器20)和视频解码器30(或简称为解码器30)代表可用于基于本申请中描述的各种示例执行各技术的设备等。
如图13所示,视频编解码系统10包括源设备12,源设备12用于将编码图像等编码 图像数据21提供给用于对编码图像数据21进行解码的目的设备14。
源设备12包括编码器20,另外即可选地,可包括图像源16、图像预处理器等预处理器(或预处理单元)18、通信接口(或通信单元)22。
图像源16可包括或可以为任意类型的用于捕获现实世界图像等的图像捕获设备,和/或任意类型的图像生成设备,例如用于生成计算机动画图像的计算机图形处理器或任意类型的用于获取和/或提供现实世界图像、计算机生成图像(例如,屏幕内容、虚拟现实(virtual reality,VR)图像和/或其任意组合(例如增强现实(augmented reality,AR)图像)的设备。所述图像源可以为存储上述图像中的任意图像的任意类型的内存或存储器。
为了区分预处理器(或预处理单元)18执行的处理,图像(或图像数据17)也可称为原始图像(或原始图像数据)17。
预处理器18用于接收(原始)图像数据17,并对图像数据17进行预处理,得到预处理图像(或预处理图像数据)19。例如,预处理器18执行的预处理可包括修剪、颜色格式转换(例如从RGB转换为YCbCr)、调色或去噪。可以理解的是,预处理单元18可以为可选组件。
视频编码器(或编码器)20用于接收预处理图像数据19并提供编码图像数据21。
源设备12中的通信接口22可用于:接收编码图像数据21并通过通信信道13向目的设备14等另一设备或任何其它设备发送编码图像数据21(或其它任意处理后的版本),以便存储或直接重建。
目的设备14包括解码器30,另外即可选地,可包括通信接口(或通信单元)28、后处理器(或后处理单元)32和显示设备34。
目的设备14中的通信接口28用于直接从源设备12或从存储设备等任意其它源设备接收编码图像数据21(或其它任意处理后的版本),例如,存储设备为编码图像数据存储设备,并将编码图像数据21提供给解码器30。
通信接口22和通信接口28可用于通过源设备12与目的设备14之间的直连通信链路,例如直接有线或无线连接等,或者通过任意类型的网络,例如有线网络、无线网络或其任意组合、任意类型的私网和公网或其任意类型的组合,发送或接收编码图像数据(或编码数据)21。
例如,通信接口22可用于将编码图像数据21封装为报文等合适的格式,和/或使用任意类型的传输编码或处理来处理所述编码后的图像数据,以便在通信链路或通信网络上进行传输。
通信接口28与通信接口22对应,例如,可用于接收传输数据,并使用任意类型的对应传输解码或处理和/或解封装对传输数据进行处理,得到编码图像数据21。
通信接口22和通信接口28均可配置为如图13中从源设备12指向目的设备14的对应通信信道13的箭头所指示的单向通信接口,或双向通信接口,并且可用于发送和接收消息等,以建立连接,确认并交换与通信链路和/或例如编码后的图像数据传输等数据传输相关的任何其它信息,等等。
视频解码器(或解码器)30用于接收编码图像数据21并提供解码图像数据(或解码 图像数据)31,其中,解码图像数据也可以称为重建后的图像数据、视频帧的重建帧或其他名称等,指的是基于编码图像数据21进行解压缩操作后得到的图像数据。
后处理器32用于对解码后的图像等解码图像数据31进行后处理,得到后处理后的图像等后处理图像数据33。后处理器32执行的后处理可以包括例如颜色格式转换(例如从YCbCr转换为RGB)、调色、修剪或重采样,或者用于产生供显示设备34等显示的解码图像数据31等任何其它处理。
显示设备34用于接收后处理图像数据33,以向用户或观看者等显示图像。显示设备34可以为或包括任意类型的用于表示重建后图像的显示器,例如,集成或外部显示屏或显示器。例如,显示屏可包括液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light emitting diode,OLED)显示器、等离子显示器、投影仪、微型LED显示器、硅基液晶显示器(liquid crystal on silicon,LCoS)、数字光处理器(digital light processor,DLP)或任意类型的其它显示屏。
视频编解码系统10还包括训练引擎25,训练引擎25用于训练编码器20或解码器30中的神经网络,也即上述方法实施例中示出的第一神经网络、第二神经网络、第三神经网络和第四神经网络。训练数据可以存入数据库(未示意)中,训练引擎25基于训练数据训练得到神经网络。需要说明的是,本申请实施例对于训练数据的来源不做限定,例如可以是从云端或其他地方获取训练数据进行神经网络训练。
训练引擎25训练得到的神经网络可以应用于视频编解码系统10以及视频编解码系统40中,例如,应用于图13所示的源设备12(例如编码器20)或目的设备14(例如解码器30)。训练引擎25可以在云端训练得到上述神经网络,然后视频编解码系统10从云端下载并使用该神经网络。
尽管图13示出了源设备12和目的设备14作为独立的设备,但设备实施例也可以同时包括源设备12和目的设备14或同时包括源设备12和目的设备14的功能,即同时包括源设备12或对应功能和目的设备14或对应功能。在这些实施例中,源设备12或对应功能和目的设备14或对应功能可以使用相同硬件和/或软件或通过单独的硬件和/或软件或其任意组合来实现。
基于描述,图13所示的源设备12和/或目的设备14中的不同单元或功能的存在和(准确)划分可能基于实际设备和应用而有所不同,这对技术人员来说是显而易见的。
请参阅图14,图14为本申请实施例提供的视频编解码系统的另一种系统架构图,结合上述图13进行描述,编码器20(例如视频编码器20)或解码器30(例如视频解码器30)或两者都可通过如图14所示的处理电路实现,例如一个或多个微处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、离散逻辑、硬件、视频编码专用处理器或其任意组合。编码器20可以通过处理电路46实现,以包含参照图14编码器20论述的各种模块和/或本文描述的任何其它解码器系统或子系统。解码器30可以通过处理电路46实现,以包含参照图15解码器30论述的各种模块和/或本文描述的任何其它解码器系统或子系统。所述处理电路46可用于执行下文论述的各种操作。如图16所示,如 果部分技术在软件中实施,则设备可以将软件的指令存储在合适的非瞬时性计算机可读存储介质中,并且使用一个或多个处理器在硬件中执行指令,从而执行本申请技术。视频编码器20和视频解码器30中的其中一个可作为组合编解码器(encoder/decoder,CODEC)的一部分集成在单个设备中,如图14所示。
源设备12和目的设备14可包括各种设备中的任一种,包括任意类型的手持设备或固定设备,例如,笔记本电脑或膝上型电脑、手机、智能手机、平板或平板电脑、相机、台式计算机、机顶盒、电视机、显示设备、数字媒体播放器、视频游戏控制台、视频流设备(例如,内容业务服务器或内容分发服务器)、广播接收设备、广播发射设备,等等,并可以不使用或使用任意类型的操作系统。在一些情况下,源设备12和目的设备14可配备用于无线通信的组件。因此,源设备12和目的设备14可以是无线通信设备。
在一些情况下,图13所示的视频编解码系统10仅仅是示例性的,本申请提供的技术可适用于视频编码设置(例如,视频编码或视频解码),这些设置不一定包括编码设备与解码设备之间的任何数据通信。在其它示例中,数据从本地存储器中检索,通过网络发送,等等。视频编码设备可以对数据进行编码并将数据存储到存储器中,和/或视频解码设备可以从存储器中检索数据并对数据进行解码。在一些示例中,编码和解码由相互不通信而只是编码数据到存储器和/或从存储器中检索并解码数据的设备来执行。
图14是基于一示例性实施例的包含视频编码器20和/或视频解码器30的视频编解码系统40的实例的说明图。视频编解码系统40可以包含成像设备41、视频编码器20、视频解码器30(和/或藉由处理电路46实施的视频编/解码器)、天线42、一个或多个处理器43、一个或多个内存存储器44和/或显示设备45。
如图14所示,成像设备41、天线42、处理电路46、视频编码器20、视频解码器30、处理器43、内存存储器44和/或显示设备45能够互相通信。在不同实例中,视频编解码系统40可以只包含视频编码器20或只包含视频解码器30。
在一些实例中,天线42可以用于传输或接收视频数据的经编码比特流。另外,在一些实例中,显示设备45可以用于呈现视频数据。处理电路46可以包含专用集成电路(application-specific integrated circuit,ASIC)逻辑、图形处理器、通用处理器等。视频编解码系统40也可以包含可选的处理器43,该可选处理器43类似地可以包含专用集成电路(application-specific integrated circuit,ASIC)逻辑、图形处理器、通用处理器等。另外,内存存储器44可以是任何类型的存储器,例如易失性存储器(例如,静态随机存取存储器(static random access memory,SRAM)、动态随机存储器(dynamic random access memory,DRAM)等)或非易失性存储器(例如,闪存等)等。在非限制性实例中,内存存储器44可以由超速缓存内存实施。在其它实例中,处理电路46可以包含存储器(例如,缓存等)用于实施图像缓冲器等。
在一些实例中,通过逻辑电路实施的视频编码器20可以包含(例如,通过处理电路46或内存存储器44实施的)图像缓冲器和(例如,通过处理电路46实施的)图形处理单元。图形处理单元可以通信耦合至图像缓冲器。图形处理单元可以包含通过处理电路46实施的视频编码器20,以实施参照图14的视频解码器20和/或本文中所描述的任何其它编码 器系统或子系统所论述的各种模块。逻辑电路可以用于执行本文所论述的各种操作。
在一些实例中,视频解码器30可以以类似方式通过处理电路46实施,以实施参照图14的视频解码器30和/或本文中所描述的任何其它解码器系统或子系统所论述的各种模块。在一些实例中,逻辑电路实施的视频解码器30可以包含(通过处理电路46或内存存储器44实施的)图像缓冲器和(例如,通过处理电路46实施的)图形处理单元。图形处理单元可以通信耦合至图像缓冲器。图形处理单元可以包含通过处理电路46实施的视频解码器30。
在一些实例中,天线42可以用于接收视频数据的经编码比特流。如所论述,经编码比特流可以包含本文所论述的与编码视频帧相关的数据、指示符、索引值、模式选择数据等,例如与编码分割相关的数据(例如,变换系数或经量化变换系数,(如所论述的)可选指示符,和/或定义编码分割的数据)。视频编解码系统40还可包含耦合至天线42并用于解码经编码比特流的视频解码器30。显示设备45用于呈现视频帧。
应理解,本申请实施例中对于参考视频编码器20所描述的实例,视频解码器30可以用于执行相反过程。关于信令语法元素,视频解码器30可以用于接收并解析这种语法元素,相应地解码相关视频数据。在一些例子中,视频编码器20可以将语法元素熵编码成经编码视频比特流。在此类实例中,视频解码器30可以解析这种语法元素,并相应地解码相关视频数据。
需要说明的是,本申请所描述的编解码过程存在于绝大部分视频编解码器中,例如H.263、H.264、MPEG-2、MPEG-4、VP8、VP9、基于AI的端到端的图像编码等对应的编解码器中。
图15为本申请实施例提供的视频译码设备400的一种示意图。视频译码设备400适用于实现本文描述的公开实施例。在一个实施例中,视频译码设备400可以是解码器,例如图14中的视频解码器30,也可以是编码器,例如图14中的视频编码器20。
视频译码设备400包括:用于接收数据的入端口410(或输入端口410)和接收单元(receiver unit,Rx)420;用于处理数据的处理器、逻辑单元或中央处理器(central processing unit,CPU)430;例如,这里的处理器430可以是神经网络处理器430;用于传输数据的发送单元(transmitter unit,Tx)440和出端口450(或输出端口450);用于存储数据的存储器460。视频译码设备400还可包括耦合到入端口410、接收单元420、发送单元440和出端口450的光电(optical-to-electrical,OE)组件和电光(electrical-to-optical,EO)组件,用于光信号或电信号的出口或入口。
处理器430通过硬件和软件实现。处理器430可实现为一个或多个处理器芯片、核(例如,多核处理器)、FPGA、ASIC和DSP。处理器430与入端口410、接收单元420、发送单元440、出端口450和存储器460通信。处理器430包括译码模块470(例如,基于神经网络NN的译码模块470)。译码模块470实施上文所公开的实施例。例如,译码模块470执行、处理、准备或提供各种编码操作。因此,通过译码模块470为视频译码设备400的功能提供了实质性的改进,并且影响了视频译码设备400到不同状态的切换。或者,以存储在存储器460中并由处理器430执行的指令来实现译码模块470。
存储器460包括一个或多个磁盘、磁带机和固态硬盘,可以用作溢出数据存储设备,用于在选择执行程序时存储此类程序,并且存储在程序执行过程中读取的指令和数据。存储器460可以是易失性和/或非易失性的,可以是只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、三态内容寻址存储器(ternary content-addressable memory,TCAM)和/或静态随机存取存储器(static random-access memory,SRAM)。
请参阅图16,图16为示例性实施例提供的装置500的一种简化框图,装置500可用作图13中的源设备12和目的设备14中的任一个或两个。
装置500中的处理器502可以是中央处理器。或者,处理器502可以是现有的或今后将研发出的能够操控或处理信息的任何其它类型设备或多个设备。虽然可以使用如图所示的处理器502等单个处理器来实施已公开的实现方式,但使用一个以上的处理器速度更快和效率更高。
在一种实现方式中,装置500中的存储器504可以是只读存储器(ROM)设备或随机存取存储器(RAM)设备。任何其它合适类型的存储设备都可以用作存储器504。存储器504可以包括处理器502通过总线512访问的代码和数据506。存储器504还可包括操作系统508和应用程序510,应用程序510包括允许处理器502执行本文所述方法的至少一个程序。例如,应用程序510可以包括应用1至N,还包括执行本文所述方法的视频译码应用。
装置500还可以包括一个或多个输出设备,例如显示器518。在一个示例中,显示器518可以是将显示器与可用于感测触摸输入的触敏元件组合的触敏显示器。显示器518可以通过总线512耦合到处理器502。
虽然装置500中的总线512在本文中描述为单个总线,但是总线512可以包括多个总线。此外,辅助储存器可以直接耦合到装置500的其它组件或通过网络访问,并且可以包括存储卡等单个集成单元或多个存储卡等多个单元。因此,装置500可以具有各种各样的配置。

Claims (21)

  1. 一种视频帧的压缩方法,其特征在于,所述方法包括:
    根据网络选择策略从多个神经网络中确定目标神经网络,所述多个神经网络包括第一神经网络和第二神经网络;
    通过所述目标神经网络对当前视频帧进行压缩编码,以得到与所述当前视频帧对应的压缩信息;
    其中,若所述压缩信息通过所述第一神经网络得到,则所述压缩信息包括所述当前视频帧的第一特征的第一压缩信息,所述当前视频帧的参考帧用于所述当前视频帧的第一特征的压缩过程;
    若所述压缩信息通过所述第二神经网络得到,则所述压缩信息包括所述当前视频帧的第二特征的第二压缩信息,所述当前视频帧的参考帧用于当前视频帧的第二特征的生成过程。
  2. 根据权利要求1所述的方法,其特征在于,
    所述第一神经网络包括编码Encoding网络和熵编码层,其中,通过编码网络从所述当前视频帧中获取所述当前视频帧的第一特征;通过熵编码层利用所述当前视频帧的参考帧执行所述当前视频帧的第一特征的熵编码过程,以输出所述第一压缩信息;
    和/或
    所述第二神经网络包括卷积网络和熵编码层,卷积网络包括多个卷积层和激励ReLU层,其中,通过卷积网络利用所述当前视频帧的参考帧得到所述当前视频帧的残差,通过所述熵编码层对所述当前视频帧的残差进行熵编码处理,以输出所述第二压缩信息,其中所述第二特征为所述残差。
  3. 根据权利要求1或2所述的方法,其特征在于,所述网络选择策略与如下任一种或多种因素相关:所述当前视频帧的位置信息或所述当前视频帧所携带的数据量。
  4. 根据权利要求3所述的方法,其特征在于,所述根据网络选择策略从多个神经网络中确定目标神经网络,包括:
    根据所述当前视频帧在所述当前视频序列中的位置信息,从所述多个神经网络中选取所述目标神经网络,所述位置信息用于指示所述当前视频帧为所述当前视频序列的第X帧;或者,
    所述根据网络选择策略从多个神经网络中确定目标神经网络,包括:
    根据所述当前视频帧的属性,从所述多个神经网络中选取所述目标神经网络,其中,所述当前视频帧的属性用于指示所述当前视频帧所携带的数据量,所述当前视频帧的属性包括以下中的任一种或多种的组合:所述当前视频帧的熵、对比度和饱和度。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:
    生成并发送与所述压缩信息对应的指示信息,所述指示信息用于指示所述压缩信息通过所述第一神经网络和所述第二神经网络中的所述目标神经网络得到。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,若所述目标神经网络为所述第一神经网络,所述通过所述目标神经网络对当前视频帧进行压缩编码,以得到与所述 当前视频帧对应的压缩信息,包括:
    通过编码网络从所述当前视频帧中获取所述当前视频帧的第一特征;
    通过熵编码层根据所述当前视频帧的参考帧,对所述当前视频帧的第一特征进行预测,以生成所述当前视频帧的预测特征,所述当前视频帧的预测特征为所述当前视频帧的第一特征的预测结果;
    通过熵编码层根据所述当前视频帧的预测特征,生成所述当前视频帧的第一特征的概率分布;
    通过熵编码层根据所述当前视频帧的第一特征的概率分布,对所述当前视频帧的第一特征进行熵编码,得到所述第一压缩信息。
  7. 一种视频帧的压缩方法,其特征在于,所述方法包括:
    通过第一神经网络对当前视频帧进行压缩编码,以得到当前视频帧的第一特征的第一压缩信息,所述当前视频帧的参考帧用于所述当前视频帧的第一特征的压缩过程;
    通过所述第一神经网络生成第一视频帧,所述第一视频帧为所述当前视频帧的重建帧;
    通过第二神经网络对所述当前视频帧进行压缩编码,以得到所述当前视频帧的第二特征的第二压缩信息,所述当前视频帧的参考帧用于当前视频帧的第二特征的生成过程;
    通过所述第二神经网络生成第二视频帧,所述第二视频帧为所述当前视频帧的重建帧;
    根据所述第一压缩信息、所述第一视频帧、所述第二压缩信息和所述第二视频帧,确定与所述当前视频帧对应的压缩信息,其中,所述确定的压缩信息是通过所述第一神经网络得到的,所述确定的压缩信息为所述第一压缩信息;或者,所述确定的压缩信息是通过所述第二神经网络得到的,所述确定的压缩信息为所述第二压缩信息。
  8. 根据权利要求7所述的方法,其特征在于,
    所述第一神经网络包括编码Encoding网络和熵编码层,其中,通过编码网络从所述当前视频帧中获取所述当前视频帧的第一特征,通过熵编码层对所述当前视频帧的第一特征进行熵编码,以输出所述第一压缩信息;
    和/或
    所述第二神经网络包括卷积网络和熵编码层,卷积网络包括多个卷积层和激励ReLU层,其中,通过卷积网络利用所述当前视频帧的参考帧得到所述当前视频帧的残差,通过熵编码层对所述当前视频帧的残差进行熵编码处理,以输出所述第二压缩信息。
  9. 一种视频帧的压缩方法,其特征在于,所述方法包括:
    通过第一神经网络对第三视频帧进行压缩编码,以得到与所述第三视频帧对应的第一压缩信息,所述第一压缩信息包括所述第三视频帧的第一特征的压缩信息,所述第三视频帧的参考帧用于所述第三视频帧的第一特征的压缩过程;
    通过第二神经网络对第四视频帧进行压缩编码,以得到与所述第四视频帧对应的第二压缩信息,所述第二压缩信息包括所述第四视频帧的第二特征的压缩信息,所述第四视频帧的参考帧用于所述第四视频帧的第二特征的生成过程,所述第三视频帧和所述第四视频帧为同一视频序列中不同的视频帧。
  10. 根据权利要求9所述的方法,其特征在于,
    所述第一神经网络包括编码Encoding网络和熵编码层,其中,通过编码网络从所述当前视频帧中获取所述当前视频帧的第一特征,通过熵编码层对所述当前视频帧的第一特征进行熵编码,以输出所述第一压缩信息;
    和/或
    所述第二神经网络包括卷积网络和熵编码层,卷积网络包括多个卷积层和激励ReLU层,其中,通过卷积网络利用所述当前视频帧的参考帧得到所述当前视频帧的残差,通过熵编码层对所述当前视频帧的残差进行熵编码处理,以输出所述第二压缩信息。
  11. 一种视频帧的解压缩方法,其特征在于,所述方法包括:
    获取当前视频帧的压缩信息;
    从多个神经网络中选择与所述当前视频帧对应的目标神经网络,所述多个神经网络包括第三神经网络和第四神经网络;
    根据所述压缩信息,通过所述目标神经网络执行解压缩操作,以得到所述当前视频帧的重建帧;
    其中,若所述目标神经网络为所述第三神经网络,则所述压缩信息包括所述当前视频帧的第一特征的第一压缩信息,所述当前视频帧的参考帧用于所述第一压缩信息的解压缩过程,以得到所述当前视频帧的第一特征,所述当前视频帧的第一特征用于所述当前视频帧的重建帧的生成过程;
    若所述目标神经网络为所述第四神经网络,则所述压缩信息包括所述当前视频帧的第二特征的第二压缩信息,所述第二压缩信息用于供所述解码器执行解压缩操作以得到所述当前视频帧的第二特征,所述当前视频帧的参考帧和所述当前视频帧的第二特征用于所述当前视频帧的重建帧的生成过程。
  12. 根据权利要求11所述的方法,其特征在于,
    所述第三神经网络包括熵解码层和解码Decoding网络,其中,通过熵解码层利用所述当前视频帧的参考帧执行所述当前视频帧的第一压缩信息的熵解码过程,通过解码网络利用所述当前视频帧的第一特征生成所述当前视频帧的重建帧;
    和/或
    所述第四神经网络包括熵解码层和卷积网络,其中,通过熵解码层对所述第二压缩信息进行熵解码,通过卷积网络利用所述当前视频帧的参考帧和所述当前视频帧的第二特征执行所述当前视频帧的重建帧的生成过程。
  13. 根据权利要求11或12所述的方法,其特征在于,所述方法还包括:
    获取与所述压缩信息对应的指示信息;
    所述从多个神经网络中选择与所述当前视频帧对应的目标神经网络,包括:
    根据所述指示信息,从所述多个神经网络中确定所述目标神经网络。
  14. 一种视频帧的解压缩方法,其特征在于,所述方法包括:
    通过第三神经网络对第三视频帧的第一压缩信息进行解压缩,以得到所述第三视频帧的重建帧,所述第一压缩信息包括所述第三视频帧的第一特征的压缩信息,所述第三视频帧的参考帧用于所述第一压缩信息的解压缩过程,以得到所述第三视频帧的第一特征,所 述第三视频帧的第一特征用于所述第三视频帧的重建帧的生成过程;
    通过第四神经网络对第四视频帧的第二压缩信息进行解压缩,以得到所述第四视频帧的重建帧,所述第二压缩信息包括所述第四视频帧的第二特征的压缩信息,所述第二压缩信息用于供所述解码器执行解压缩操作以得到所述第四视频帧的第二特征,所述第四视频帧的参考帧和所述第四视频帧的第二特征用于所述第四视频帧的重建帧的生成过程。
  15. 根据权利要求14所述的方法,其特征在于,
    所述第三神经网络包括熵解码层和解码Decoding网络,其中,通过熵解码层利用所述当前视频帧的参考帧执行所述当前视频帧的第一压缩信息的熵解码过程,通过解码网络利用所述当前视频帧的第一特征生成所述当前视频帧的重建帧;
    和/或
    所述第四神经网络包括熵解码层和卷积网络,其中,通过熵解码层对所述第二压缩信息进行熵解码,通过卷积网络利用所述当前视频帧的参考帧和所述当前视频帧的第二特征执行所述当前视频帧的重建帧的生成过程。
  16. 一种编码器,其特征在于,包括处理电路,用于执行如权利要求1-10任一项所述的方法。
  17. 一种解码器,其特征在于,包括处理电路,用于执行如权利要求11-15任一项所述的方法。
  18. 一种计算机程序产品,其特征在于,包括程序代码,当所述程序代码在计算机或处理器上执行时,用于执行如权利要求1-15任一项所述的方法。
  19. 一种编码器,其特征在于,包括:
    一个或多个处理器;
    非瞬时性计算机可读存储介质,耦合到所述处理器,存储有所述处理器执行的程序指令,其中,所述程序指令在由所述处理器执行时,使得所述编码器执行如权利要求1-10任一项所述的方法。
  20. 一种解码器,其特征在于,包括:
    一个或多个处理器;
    非瞬时性计算机可读存储介质,耦合到所述处理器,存储有所述处理器执行的程序指令,其中,所述程序指令在由所述处理器执行时,使得所述解码器执行如权利要求11-15任一项所述的方法。
  21. 一种非瞬时性计算机可读存储介质,其特征在于,包括程序代码,当所述程序代码由计算机设备执行时,用于执行基于权利要求1-15任一项所述的方法。
PCT/CN2021/112077 2020-11-13 2021-08-11 一种视频帧的压缩和视频帧的解压缩方法及装置 WO2022100173A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP21890702.0A EP4231644A4 (en) 2020-11-13 2021-08-11 METHOD AND APPARATUS FOR VIDEO FRAME COMPRESSION AND METHOD AND APPARATUS FOR VIDEO FRAME DECOMPRESSION
JP2023528362A JP2023549210A (ja) 2020-11-13 2021-08-11 ビデオフレーム圧縮方法、ビデオフレーム伸長方法及び装置
CN202180076647.0A CN116918329A (zh) 2020-11-13 2021-08-11 一种视频帧的压缩和视频帧的解压缩方法及装置
US18/316,750 US20230281881A1 (en) 2020-11-13 2023-05-12 Video Frame Compression Method, Video Frame Decompression Method, and Apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011271217.8 2020-11-13
CN202011271217.8A CN114501031B (zh) 2020-11-13 2020-11-13 一种压缩编码、解压缩方法以及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/316,750 Continuation US20230281881A1 (en) 2020-11-13 2023-05-12 Video Frame Compression Method, Video Frame Decompression Method, and Apparatus

Publications (1)

Publication Number Publication Date
WO2022100173A1 true WO2022100173A1 (zh) 2022-05-19

Family

ID=81491074

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2021/107500 WO2022100140A1 (zh) 2020-11-13 2021-07-21 一种压缩编码、解压缩方法以及装置
PCT/CN2021/112077 WO2022100173A1 (zh) 2020-11-13 2021-08-11 一种视频帧的压缩和视频帧的解压缩方法及装置

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/107500 WO2022100140A1 (zh) 2020-11-13 2021-07-21 一种压缩编码、解压缩方法以及装置

Country Status (5)

Country Link
US (1) US20230281881A1 (zh)
EP (1) EP4231644A4 (zh)
JP (1) JP2023549210A (zh)
CN (2) CN114501031B (zh)
WO (2) WO2022100140A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115529457B (zh) * 2022-09-05 2024-05-14 清华大学 基于深度学习的视频压缩方法和装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107105278A (zh) * 2017-04-21 2017-08-29 中国科学技术大学 运动矢量自动生成的视频编解码框架
CN107172428A (zh) * 2017-06-06 2017-09-15 西安万像电子科技有限公司 图像的传输方法、装置和系统
US20190124346A1 (en) * 2017-10-19 2019-04-25 Arizona Board Of Regents On Behalf Of Arizona State University Real time end-to-end learning system for a high frame rate video compressive sensing network
CN110401836A (zh) * 2018-04-25 2019-11-01 杭州海康威视数字技术股份有限公司 一种图像解码、编码方法、装置及其设备
CN110913220A (zh) * 2019-11-29 2020-03-24 合肥图鸭信息科技有限公司 一种视频帧编码方法、装置及终端设备
US20200236349A1 (en) * 2019-01-22 2020-07-23 Apple Inc. Predictive coding with neural networks
CN111447449A (zh) * 2020-04-01 2020-07-24 北京奥维视讯科技有限责任公司 基于roi的视频编码方法和系统以及视频传输和编码系统

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2646575A1 (fr) * 1989-04-26 1990-11-02 Labo Electronique Physique Procede et structure pour la compression de donnees
KR102124714B1 (ko) * 2015-09-03 2020-06-19 미디어텍 인크. 비디오 코딩에서의 신경망 기반 프로세싱의 방법 및 장치
US11593632B2 (en) * 2016-12-15 2023-02-28 WaveOne Inc. Deep learning based on image encoding and decoding
CN107197260B (zh) * 2017-06-12 2019-09-13 清华大学深圳研究生院 基于卷积神经网络的视频编码后置滤波方法
CN107396124B (zh) * 2017-08-29 2019-09-20 南京大学 基于深度神经网络的视频压缩方法
CN111641832B (zh) * 2019-03-01 2022-03-25 杭州海康威视数字技术股份有限公司 编码方法、解码方法、装置、电子设备及存储介质
CN111083494A (zh) * 2019-12-31 2020-04-28 合肥图鸭信息科技有限公司 一种视频编码方法、装置及终端设备
CN111263161B (zh) * 2020-01-07 2021-10-26 北京地平线机器人技术研发有限公司 视频压缩处理方法、装置、存储介质和电子设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107105278A (zh) * 2017-04-21 2017-08-29 中国科学技术大学 运动矢量自动生成的视频编解码框架
CN107172428A (zh) * 2017-06-06 2017-09-15 西安万像电子科技有限公司 图像的传输方法、装置和系统
US20190124346A1 (en) * 2017-10-19 2019-04-25 Arizona Board Of Regents On Behalf Of Arizona State University Real time end-to-end learning system for a high frame rate video compressive sensing network
CN110401836A (zh) * 2018-04-25 2019-11-01 杭州海康威视数字技术股份有限公司 一种图像解码、编码方法、装置及其设备
US20200236349A1 (en) * 2019-01-22 2020-07-23 Apple Inc. Predictive coding with neural networks
CN110913220A (zh) * 2019-11-29 2020-03-24 合肥图鸭信息科技有限公司 一种视频帧编码方法、装置及终端设备
CN111447449A (zh) * 2020-04-01 2020-07-24 北京奥维视讯科技有限责任公司 基于roi的视频编码方法和系统以及视频传输和编码系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4231644A4

Also Published As

Publication number Publication date
CN114501031B (zh) 2023-06-02
US20230281881A1 (en) 2023-09-07
CN114501031A (zh) 2022-05-13
EP4231644A4 (en) 2024-03-20
WO2022100140A1 (zh) 2022-05-19
CN116918329A (zh) 2023-10-20
JP2023549210A (ja) 2023-11-22
EP4231644A1 (en) 2023-08-23

Similar Documents

Publication Publication Date Title
EP3571841B1 (en) Dc coefficient sign coding scheme
TWI806199B (zh) 特徵圖資訊的指示方法,設備以及電腦程式
US10506258B2 (en) Coding video syntax elements using a context tree
TW202228081A (zh) 用於從位元流重建圖像及用於將圖像編碼到位元流中的方法及裝置、電腦程式產品
US11558619B2 (en) Adaptation of scan order for entropy coding
US20230362378A1 (en) Video coding method and apparatus
US20230209096A1 (en) Loop filtering method and apparatus
WO2023279961A1 (zh) 视频图像的编解码方法及装置
US20240105193A1 (en) Feature Data Encoding and Decoding Method and Apparatus
WO2022063267A1 (zh) 帧内预测方法及装置
WO2022100173A1 (zh) 一种视频帧的压缩和视频帧的解压缩方法及装置
US20230396810A1 (en) Hierarchical audio/video or picture compression method and apparatus
WO2023193629A1 (zh) 区域增强层的编解码方法和装置
WO2022194137A1 (zh) 视频图像的编解码方法及相关设备
CN114554205B (zh) 一种图像编解码方法及装置
KR20230145096A (ko) 신경망 기반 픽처 프로세싱에서의 보조 정보의 독립적 위치결정
CN110731082B (zh) 使用反向排序来压缩视频帧组
WO2023279968A1 (zh) 视频图像的编解码方法及装置
WO2024007820A1 (zh) 数据编解码方法及相关设备
WO2022140937A1 (zh) 点云编解码方法与系统、及点云编码器与点云解码器
WO2023091040A1 (en) Generalized difference coder for residual coding in video compression
KR20240064698A (ko) 특징 맵 인코딩 및 디코딩 방법 및 장치
WO2023059689A1 (en) Systems and methods for predictive coding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21890702

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180076647.0

Country of ref document: CN

Ref document number: 2023528362

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 2021890702

Country of ref document: EP

Effective date: 20230516

NENP Non-entry into the national phase

Ref country code: DE