WO2022100173A1 - 一种视频帧的压缩和视频帧的解压缩方法及装置 - Google Patents
一种视频帧的压缩和视频帧的解压缩方法及装置 Download PDFInfo
- Publication number
- WO2022100173A1 WO2022100173A1 PCT/CN2021/112077 CN2021112077W WO2022100173A1 WO 2022100173 A1 WO2022100173 A1 WO 2022100173A1 CN 2021112077 W CN2021112077 W CN 2021112077W WO 2022100173 A1 WO2022100173 A1 WO 2022100173A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video frame
- current video
- neural network
- frame
- feature
- Prior art date
Links
- 238000007906 compression Methods 0.000 title claims abstract description 486
- 230000006835 compression Effects 0.000 title claims abstract description 472
- 238000000034 method Methods 0.000 title claims abstract description 225
- 230000006837 decompression Effects 0.000 title claims abstract description 58
- 238000013528 artificial neural network Methods 0.000 claims abstract description 630
- 230000008569 process Effects 0.000 claims abstract description 109
- 238000012545 processing Methods 0.000 claims description 63
- 238000009826 distribution Methods 0.000 claims description 22
- 238000003860 storage Methods 0.000 claims description 17
- 230000005284 excitation Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 3
- 230000008901 benefit Effects 0.000 abstract description 5
- 238000009825 accumulation Methods 0.000 abstract description 4
- 238000012549 training Methods 0.000 description 156
- 238000010586 diagram Methods 0.000 description 44
- 230000006870 function Effects 0.000 description 36
- 230000003287 optical effect Effects 0.000 description 33
- 238000004891 communication Methods 0.000 description 26
- 238000013473 artificial intelligence Methods 0.000 description 25
- 238000005516 engineering process Methods 0.000 description 14
- 230000001960 triggered effect Effects 0.000 description 9
- 230000009286 beneficial effect Effects 0.000 description 8
- 238000003672 processing method Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 6
- 238000005457 optimization Methods 0.000 description 6
- 238000012546 transfer Methods 0.000 description 6
- 239000000872 buffer Substances 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000012805 post-processing Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000003384 imaging method Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000009966 trimming Methods 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 1
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/60—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
- H04N19/61—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/103—Selection of coding mode or of prediction mode
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/124—Quantisation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/136—Incoming video signal characteristics or properties
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/90—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
- H04N19/91—Entropy coding, e.g. variable length coding [VLC] or arithmetic coding
Definitions
- the present application relates to the field of artificial intelligence, and in particular, to a method and apparatus for compressing video frames and decompressing video frames.
- Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- video frame compression based on deep learning (deep learning) neural network is a common application of artificial intelligence.
- the encoder calculates the optical flow of the current video frame relative to the reference frame of the current video frame through the neural network, generates the optical flow of the original current video frame relative to the reference frame, compresses and encodes the aforementioned optical flow, and obtains the compressed
- the reference frame of the current video frame and the current video frame belong to the current video sequence, and the reference frame of the current video frame is the video frame that needs to be referenced when the current video frame is compressed and encoded.
- Decompress the compressed optical flow to obtain the decompressed optical flow generate the predicted current video frame according to the decompressed optical flow and the reference frame, and calculate the original current video frame and the predicted current video through the neural network Residuals between frames, the aforementioned residuals are compressed and encoded.
- the compressed optical flow and the compressed residual are sent to the decoder, so that the decoder can obtain the decompressed information through the neural network according to the decompressed reference frame, the decompressed optical flow and the decompressed residual. of the current video frame.
- the present application provides a video frame compression and video frame decompression method and device.
- the quality of the reconstructed frame of the current video frame will not depend on the reference frame of the current video frame.
- the quality of the reconstructed frame thereby avoiding the accumulation of errors between frames, so as to improve the quality of the reconstructed frame of the video frame; in addition, the advantages of the first neural network and the second neural network are integrated to achieve the maximum reduction in the data that needs to be transmitted.
- the quality of the reconstructed frame of the video frame is improved on the basis of the amount.
- the present application provides a video frame compression method, which can apply artificial intelligence technology to the field of video frame encoding and decoding.
- the method may include: an encoder determining a target neural network from a plurality of neural networks according to a network selection strategy, the plurality of neural networks including a first neural network and a second neural network; compressing and encoding the current video frame through the target neural network to obtain Compression information corresponding to the current video frame.
- the compression information includes the first compression information of the first feature of the current video frame
- the reference frame of the current video frame is used for the compression process of the first feature of the current video frame
- the current The reference frame of the video frame is not used in the generation process of the first feature of the current video frame; that is, the first feature of the current video frame can be obtained only based on the current video frame, and the first feature of the current video frame is not generated during the generation process.
- the reference frame of the current video frame is required. If the compression information is obtained through the second neural network, the compression information includes the second compression information of the second feature of the current video frame, and the reference frame of the current video frame is used for the generation process of the second feature of the current video frame.
- the current video frame is the original video frame included in the current video sequence; the reference frame of the current video frame may be the original video frame in the current video sequence, or may not be the original video frame in the current video sequence.
- the reference frame of the current video frame may be a video frame obtained by transform-coding the original reference frame through an encoding network, and then performing inverse transformation and decoding through a decoding network; or, the reference frame of the current video frame is the original reference frame of the encoder.
- the compression information when the compression information is obtained through the first neural network, the compression information carries the compression information of the first feature of the current video frame, and the reference frame of the current video frame is only used for the first feature of the current video frame
- the compression process is not used for the generation process of the first feature of the current video frame, so that the decoder does not need to use the reference frame of the current video frame after performing the decompression operation according to the first compression information to obtain the first feature of the current video frame.
- the reconstructed frame of the current video frame can be obtained, so when the compression information is obtained through the first neural network, the quality of the reconstructed frame of the current video frame will not depend on the quality of the reconstructed frame of the reference frame of the current video frame, thereby avoiding The error is accumulated between frames to improve the quality of the reconstructed frame of the video frame; in addition, since the second feature of the current video frame is generated according to the reference frame of the current video frame, the second compression information of the second feature corresponds to The amount of data is smaller than the amount of data corresponding to the first compressed information of the first feature.
- the encoder can use the first neural network and the second neural network to process different video frames in the current video sequence to synthesize the first neural network.
- the advantage of the neural network and the second neural network is to improve the quality of the reconstructed frame of the video frame on the basis of minimizing the amount of data to be transmitted.
- the first neural network includes an encoding (Encoding) network and an entropy encoding layer, wherein the first feature of the current video frame is obtained from the current video frame through the encoding network; the entropy encoding layer is used to obtain the first feature of the current video frame; Entropy encoding is performed on the first feature of the current video frame to output first compression information. Further, the first feature of the current video frame is obtained by performing transform coding on the current video frame through the first coding network, and then performing quantization after the transform coding is performed.
- the second neural network includes a convolutional network and an entropy coding layer
- the convolutional network includes a plurality of convolutional layers and an excitation ReLU layer
- the current video is utilized through the convolutional network
- the reference frame of the frame obtains the residual of the current video frame, and entropy encoding is performed on the residual of the current video frame through the entropy encoding layer to output the second compression information.
- the encoder compresses and encodes the current video frame through the target neural network to obtain compression information corresponding to the current video frame, which may include :
- the encoder generates the optical flow of the original current video frame relative to the reference frame of the current video frame, and compresses and encodes the aforementioned optical flow to obtain the compressed optical flow, wherein the second feature of the current video frame includes the original current video.
- the encoder can also decompress the compressed optical flow to obtain the decompressed optical flow, and generate the predicted current video frame according to the decompressed optical flow and the reference frame of the current video frame; Calculate the residual between the original current video frame and the predicted current video frame; wherein, the second feature of the current video frame includes the optical flow of the original current video frame relative to the reference frame of the current video frame and the original current video frame and the predicted residual between the current video frame.
- the network selection strategy is related to any one or more of the following factors: location information of the current video frame or the amount of data carried by the current video frame.
- the encoder determines the target neural network from multiple neural networks according to a network selection strategy, including: the encoder obtains position information of the current video frame in the current video sequence, wherein the position information It is used to indicate that the current video frame is the X th frame of the current video sequence, and the position information of the current video frame in the current video sequence can be specifically expressed as an index number, and the index number can be specifically expressed in the form of a character string.
- the encoder selects the target neural network from multiple neural networks according to the location information.
- the encoder determines the target neural network from multiple neural networks according to the network selection strategy, including: the encoder selects the target neural network from the multiple neural networks according to the attributes of the current video frame, wherein the attributes of the current video frame are used for Reflecting the amount of data carried by the current video frame, the attributes of the current video frame include any one or a combination of the following: entropy, contrast, and saturation of the current video frame.
- the target neural network is selected from multiple neural networks according to the position information of the current video frame in the current video sequence; alternatively, the target neural network can be selected from multiple neural networks according to at least one attribute of the current video network, and then can use the target neural network to generate the compression information of the current video frame, provide a variety of simple and easy-to-operate implementation schemes, and improve the implementation flexibility of the scheme.
- the method may further include: the encoder generating and sending at least one indication information corresponding to one or more pieces of compressed information one-to-one.
- each indication information is used to indicate that a piece of compressed information is obtained through the target neural network in the first neural network and the second neural network, that is, the piece of indication information is used to indicate that a piece of compressed information is obtained through the first neural network and the second neural network. Which neural network in the neural network got it.
- the decoder can obtain multiple pieces of indication information corresponding to multiple pieces of compression information, so that the decoder can know that each video frame in the current video sequence uses the first neural network and the second neural network.
- Which neural network performs the decompression operation is beneficial to improve the time for the decoder to decode the compressed information, that is, to improve the efficiency of video frame transmission for the entire encoder and decoder.
- the encoder compresses and encodes the current video frame through the target neural network to obtain compression information corresponding to the current video frame, which may include :
- the encoder obtains the first feature of the current video frame from the current video frame through an encoding network; according to the reference frame of the current video frame, the encoder predicts the feature of the current video frame to generate the prediction of the current video frame. feature; wherein, the prediction feature of the current video frame is the prediction result of the first feature of the current video frame, and the prediction feature of the current video frame and the data shape of the first feature of the current video frame are the same.
- the encoder generates the probability distribution of the first feature of the current video frame according to the prediction feature of the current video frame through the entropy coding layer; the probability distribution of the first feature of the current video frame includes the mean value of the first feature of the current video frame and the current video frame. The variance of the first feature of .
- the encoder performs entropy coding on the first feature of the current video frame according to the probability distribution of the first feature of the current video frame through the entropy coding layer to obtain the first compression information.
- the encoder because the encoder generates the probability distribution of the first feature of the current video frame according to the prediction feature of the current video frame, and then performs the first feature of the current video frame according to the probability distribution of the first feature of the current video frame.
- Compression coding to obtain the first compression information of the current video frame because the higher the similarity between the prediction feature of the current video frame and the first feature, the greater the compression rate of the first feature, and the final obtained first
- the compressed information will be smaller, and the prediction feature of the current video frame is obtained by predicting the feature of the current video frame according to the reference frame of the current video frame, so as to improve the prediction feature of the current video frame and the first feature of the current video frame. Therefore, the size of the compressed first compressed information can be reduced, that is, not only the quality of the reconstructed frame obtained by the decoder can be guaranteed, but also the size of the data transmitted by the encoder and the decoder can be reduced.
- the first neural network and the second neural network are both neural networks that have undergone training operations, and the model parameters of the first neural network are based on the first loss function of the first neural network. updated.
- the first loss function includes the loss item of the similarity between the first training video frame and the first training reconstruction frame and the loss item of the data size of the compression information of the first training video frame
- the first training reconstruction frame is the first The reconstruction frame of the training video frame
- the training target of the first loss function includes narrowing the similarity between the first training video frame and the first training reconstruction frame, and also includes reducing the size of the first compression information of the first training video frame .
- the second loss function includes the second training video frames and the second training video frame.
- the second training reconstruction frame is the reconstruction frame of the second training video frame
- the reference frame of the second training video frame is the video frame processed by the first neural network
- the training target of the second loss function includes zooming in the second training video frame.
- the similarity between the video frame and the second training reconstruction frame further includes reducing the size of the second compression information of the second training video frame.
- the reference frame used by the second neural network may be processed by the first neural network in the execution stage
- the reference frame processed by the first neural network is used to perform training on the second neural network
- the operation is conducive to maintaining the consistency between the training phase and the execution phase, so as to improve the accuracy of the execution phase.
- the embodiments of the present application provide a video frame compression method, which can apply artificial intelligence technology to the field of video frame encoding and decoding.
- the encoder compresses and encodes the current video frame through the first neural network to obtain the first compression information of the first feature of the current video frame, and the reference frame of the current video frame is used for the compression process of the first feature of the current video frame; by
- the first neural network generates a first video frame, which is a reconstructed frame of the current video frame.
- the encoder compresses and encodes the current video frame through the second neural network to obtain the second compression information of the second feature of the current video frame, and the reference frame of the current video frame is used for the generation process of the second feature of the current video frame; by The second neural network generates a second video frame, which is a reconstructed frame of the current video frame.
- the encoder determines compression information corresponding to the current video frame according to the first compression information, the first video frame, the second compression information and the second video frame, wherein the determined compression information is obtained through the first neural network, and the determined compression information is The compressed information is the first compressed information; or, the determined compressed information is obtained through the second neural network, and the determined compressed information is the second compressed information.
- the final required compression information is selected from the first compression information and the second compression information.
- the performance of the compressed information corresponding to the entire current video sequence can be improved as much as possible.
- the encoder may adopt the same selection method of target compression information. Specifically, the encoder calculates a first score value corresponding to the first compression information (that is, the first score value corresponding to the first neural network) according to the first compression information and the first video frame, and calculates a first score value corresponding to the first neural network according to the second compression information and For the second video frame, the second score value corresponding to the second compressed information (that is, the second score value corresponding to the second neural network) is calculated, and the encoder selects the lowest value from the first score value and the second score value
- the score value of the video frame is determined from the first compression information and the second compression information, and the compression information corresponding to the score value with the lowest value is determined as the compression information of the current video frame, that is, the neural network corresponding to the score value with the lowest value is determined. for the target neural network.
- the encoder first performs a compression operation on the current video frame through the first neural network and the second neural network, and obtains a first score corresponding to the first compression information value, and the second score value corresponding to the second compression information, and determine the score value with the lowest value from it, so as to make the score value of all video frames in the entire current video sequence as low as possible, so as to improve the total value of the entire current video sequence.
- the performance of the corresponding compressed information is performed for each video frame in the current video sequence.
- the encoder may use one cycle as a calculation unit, and generate a plurality of The value of the coefficient and the offset of the first fitting formula corresponding to the first score value; The values of the coefficients and offsets of the second fitting formula corresponding to the two score values.
- the encoder determines the compression information of the current video frame from the first compression information and the second compression information according to the first fitting formula and the second fitting formula, wherein the optimization objective is to make the average value of the total score values in one cycle Minimum, that is, the optimization objective is to minimize the value of the total score value in one cycle.
- the technician finds the change rule of the first score value and the second score value in a single cycle during the research, and takes the minimum average value of the total score value in a cycle as the optimization goal, that is, in the When determining the target compression information corresponding to each current video frame, not only the score value of the current video frame, but also the average value of the score value in the whole period should be considered, so as to further reduce the correlation with all video frames in the entire current video sequence.
- the corresponding score value can further improve the performance of the compressed information corresponding to the entire current video sequence.
- the encoder may also perform the steps performed by the encoder in each possible implementation manner of the first aspect.
- the steps performed by the encoder in each possible implementation manner of the first aspect.
- the embodiments of the present application provide a video frame compression method, which can apply artificial intelligence technology to the field of video frame encoding and decoding.
- the method may include: the encoder compresses and encodes the third video frame through the first neural network to obtain first compression information corresponding to the third video frame, where the first compression information includes compression information of the first feature of the third video frame , the reference frame of the third video frame is used for the compression process of the first feature of the third video frame; the encoder compresses and encodes the fourth video frame through the second neural network to obtain the second compression corresponding to the fourth video frame.
- the second compression information includes compression information of the second feature of the fourth video frame, and the reference frame of the fourth video frame is used for the generation process of the second feature of the fourth video frame.
- the encoder may also perform the steps performed by the encoder in each possible implementation manner of the first aspect.
- the encoder may also perform the steps performed by the encoder in each possible implementation manner of the first aspect.
- an embodiment of the present application provides a video frame decompression method, which can apply artificial intelligence technology to the field of video frame encoding and decoding.
- the decoder obtains the compression information of the current video frame, and performs a decompression operation through the target neural network according to the compression information of the current video frame, so as to obtain the reconstructed frame of the current video frame.
- the target neural network is a neural network selected from a plurality of neural networks, and the plurality of neural networks includes a third neural network and a fourth neural network.
- the compression information includes the first compression information of the first feature of the current video frame, and the reference frame of the current video frame is used for the decompression process of the first compression information, so as to obtain the first compression information of the current video frame.
- the first feature, the first feature of the current video frame is used for the generation process of the reconstructed frame of the current video frame;
- the compression information includes the second compression information of the second feature of the current video frame,
- the second compression information is used for the decoder to perform a decompression operation to obtain the second feature of the current video frame.
- the reference frame of the current video frame and the second feature of the current video frame are used for the generation process of the reconstructed frame of the current video frame.
- the reconstructed frame of the video frame and the reference frame of the current video frame are included in the current video sequence.
- the third neural network includes an entropy decoding layer and a decoding Decoding network, wherein the entropy decoding layer uses the reference frame of the current video frame to perform entropy decoding of the first compression information of the current video frame
- a reconstructed frame of the current video frame is generated by using the first feature of the current video frame through the decoding network.
- the decoder performs the decompression operation through the target neural network according to the compression information of the current video frame, so as to obtain the reconstructed frame of the current video frame, which may include: a decoder.
- the probability distribution of the first feature is generated according to the predicted feature of the current video frame, wherein the predicted feature of the current video frame is obtained by predicting the first feature according to the reference frame of the current video frame.
- the decoder performs entropy decoding on the compressed information according to the probability distribution of the first feature to obtain the first feature, and performs inverse transform decoding on the first feature to obtain the reconstructed frame of the current video frame.
- the fourth neural network includes an entropy decoding layer and a convolutional network, wherein the entropy decoding is performed on the second compressed information through the entropy decoding layer, and the reference of the current video frame is used through the convolutional network.
- the frame and the second feature of the current video frame perform a process of generating a reconstructed frame of the current video frame.
- the decoder performs a decompression operation through the target neural network according to the compression information of the current video frame to obtain the reconstructed frame of the current video frame, which may include: a decoder.
- the second compressed information is decompressed to obtain the second feature of the fourth video frame, that is, the optical flow of the original current video frame relative to the reference frame of the current video frame and the original current video frame and the predicted current frame are obtained. Residuals between video frames.
- the encoder predicts the current video frame according to the optical flow of the original current video frame relative to the reference frame of the current video frame and the reference frame of the current video frame, and obtains the predicted current video frame;
- the residual between the current video frame and the predicted current video frame generates a reconstructed frame of the current video frame.
- the method may further include: the decoder obtains at least one indication information corresponding to the at least one compressed information; and according to the at least one indication information and the compression information of the current video frame, from including A target neural network corresponding to the current video frame is determined from among the plurality of neural networks of the third neural network and the fourth neural network.
- the embodiments of the present application provide a video frame decompression method, which can apply artificial intelligence technology to the field of video frame encoding and decoding.
- the decoder decompresses the first compressed information of the third video frame through the third neural network to obtain a reconstructed frame of the third video frame, where the first compressed information includes the compressed information of the first feature of the third video frame, and the third
- the reference frame of the video frame is used in the decompression process of the first compressed information to obtain the first feature of the third video frame, and the first feature of the third video frame is used in the generation process of the reconstructed frame of the third video frame.
- the decoder decompresses the second compression information of the fourth video frame through the fourth neural network to obtain the decompressed fourth video frame, the second compression information includes the compression information of the second feature of the fourth video frame, and the first The second compression information is used for the decoder to perform a decompression operation to obtain the second feature of the fourth video frame, and the reference frame of the fourth video frame and the second feature of the fourth video frame are used for the generation of the reconstructed frame of the fourth video frame process.
- the decoder may also perform the steps performed by the decoder in each possible implementation manner of the fourth aspect.
- the steps performed by the decoder in each possible implementation manner of the fourth aspect.
- an embodiment of the present application provides an encoder, characterized in that it includes a processing circuit configured to execute any one of the first aspect, the second aspect, the third aspect, the fourth aspect, or the fifth aspect the method described.
- an embodiment of the present application provides a decoder, characterized in that it includes a processing circuit configured to execute any one of the first aspect, the second aspect, the third aspect, the fourth aspect, or the fifth aspect the method described.
- an embodiment of the present application provides a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the first aspect, the second aspect, the third aspect, the fourth aspect or the fifth aspect The method of any of the aspects.
- embodiments of the present application provide an encoder, which may include one or more processors, a non-transitory computer-readable storage medium, coupled to the processors, and storing program instructions executed by the processors , wherein, when executed by the processor, the program instructions cause the encoder to implement the video frame compression method described in the first aspect, the second aspect or the third aspect.
- embodiments of the present application provide a decoder, which may include one or more non-transitory computer-readable storage media, coupled to the processor, and storing program instructions executed by the processor, wherein, When executed by the processor, the program instructions cause the decoder to implement the video frame decompression method according to the fourth aspect or the fifth aspect when executed.
- an embodiment of the present application provides a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium includes program codes, and when the non-transitory computer-readable storage medium is executed on a computer, causes the computer to execute the above The method of any of the first, second, third, fourth or fifth aspects.
- an embodiment of the present application provides a circuit system, the circuit system includes a processing circuit, and the processing circuit is configured to execute the first aspect, the second aspect, the third aspect, the fourth aspect, or the fifth aspect The method of any of the aspects.
- an embodiment of the present application provides a chip system, where the chip system includes a processor for implementing the functions involved in the above aspects, for example, sending or processing the data involved in the above method and/or information.
- the chip system further includes a memory for storing necessary program instructions and data of the server or the communication device.
- the chip system may be composed of chips, or may include chips and other discrete devices.
- FIG. 1a is a schematic structural diagram of an artificial intelligence main body framework provided by an embodiment of the application.
- FIG. 1b is an application scenario diagram of the video frame compression and decompression method provided by the embodiment of the application.
- FIG. 1c is another application scenario diagram of the video frame compression and decompression method provided by the embodiment of the present application.
- FIG. 2 is a schematic diagram of a principle of a video frame compression method provided by an embodiment of the present application
- FIG. 3 is a schematic flowchart of a method for compressing a video frame according to an embodiment of the present application
- FIG. 4 is a schematic diagram of the correspondence between the position of the current video frame and the adopted target neural network in the video frame compression method provided by the embodiment of the present application;
- 5a is a schematic structural diagram of a first neural network provided by an embodiment of the present application.
- FIG. 5b is a schematic structural diagram of a second neural network provided by an embodiment of the application.
- 5c is a schematic diagram of a comparison of the first feature and the second feature in the video frame compression method provided by the embodiment of the present application;
- FIG. 6 is a schematic diagram of another principle of a video frame compression method provided by an embodiment of the present application.
- FIG. 7a is another schematic flowchart of a video frame compression method provided by an embodiment of the present application.
- 7b is a schematic diagram of a first score value and a second score value in a video frame compression method provided by an embodiment of the present application;
- 7c is a schematic diagram of calculating the coefficients and offset values of the first fitting formula and the coefficients and offsets of the second fitting formula in the video frame compression method provided by the embodiment of the present application;
- FIG. 8 is another schematic flowchart of a video frame compression method provided by an embodiment of the present application.
- FIG. 9 is a schematic diagram of a video frame compression method provided by an embodiment of the present application.
- 10a is a schematic flowchart of a method for decompressing a video frame provided by an embodiment of the present application
- 10b is another schematic flowchart of a method for decompressing a video frame provided by an embodiment of the present application
- FIG. 11 is another schematic flowchart of a method for decompressing a video frame provided by an embodiment of the present application.
- FIG. 12 is a schematic flowchart of a training method for a video frame compression and decompression system provided by an embodiment of the application;
- FIG. 13 is a system architecture diagram of a video encoding and decoding system provided by an embodiment of the present application.
- FIG. 14 is another system architecture diagram of a video encoding and decoding system provided by an embodiment of the present application.
- FIG. 15 is a schematic diagram of a video decoding device provided by an embodiment of the application.
- FIG. 16 is a simplified block diagram of an apparatus provided by an embodiment of the present application.
- Figure 1a shows a schematic structural diagram of the main frame of artificial intelligence.
- the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
- the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
- the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
- the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communicate with the outside through sensors; computing power is provided by a smart chip, as an example, the smart chip includes a central processing unit (CPU), a neural-network processing unit (NPU), a graphics processor (graphics unit) processing unit, GPU), application specific integrated circuit (ASIC), field programmable gate array (FPGA) and other hardware acceleration chips; the basic platform includes distributed computing framework and network-related platforms Guarantee and support can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
- CPU central processing unit
- NPU neural-network processing unit
- graphics processor graphics processor
- ASIC application specific integrated circuit
- FPGA field programmable gate array
- Guarantee and support can include cloud storage and computing, interconnection networks, etc.
- sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed
- the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
- the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
- Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
- machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
- Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
- Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
- some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
- Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, the productization of intelligent information decision-making, and the realization of landing applications. Its application areas mainly include: intelligent terminals, intelligent manufacturing, Smart transportation, smart home, smart healthcare, smart security, autonomous driving, smart city, etc.
- FIG. 1b is an application scenario diagram of the video frame compression and decompression method provided by the embodiment of the present application.
- the client's album can store videos, and the demand for sending the videos in the album to the cloud server will be stored.
- the client ie, the encoder
- the cloud server ie the decoder
- the cloud server can use AI technology to decompress to obtain the reconstruction of the video frame frame
- the surveillance meeting needs to send the collected video to the management center, then the surveillance (that is, the encoder) needs to compress the video frames in the video before sending the video to the management center , correspondingly, the management center (ie, the decoder) needs to decompress the video frames in the video to obtain the video frames.
- the surveillance that is, the encoder
- the management center ie, the decoder
- FIG. 1 c is another application scenario diagram of the video frame compression and decompression method provided by the embodiment of the present application.
- the anchor utilizes the client to collect the video
- the client needs to send the collected video to the server
- the server distributes the video to the viewing user
- the client That is, the encoder
- the client it needs to use AI technology to compress and encode the video frames in the video.
- the client that is, the decoder
- the client that is, the decoder
- the example in FIG. 1c is only for the convenience of understanding this solution, and is not used to limit this solution.
- AI technology that is, a neural network
- the embodiment of the present application includes the inference stage of the aforementioned neural network and the training stage of the aforementioned neural network.
- the processes of the inference phase and the training phase are different. The following describes the inference phase and the training phase respectively.
- the operation of compression encoding is performed by the encoder, and the operation of decompression is performed by the decoder.
- the operations of the encoder and the decoder are respectively described below.
- the encoder since there are multiple neural networks configured in the encoder, the encoder generates the target compression information corresponding to the current video for the process. In an implementation manner, the encoder may first determine a target neural network from multiple neural networks according to a network selection strategy, and then generate target compression information of the current video frame through the target neural network.
- the encoder may separately generate multiple pieces of compression information of the current video frame through multiple neural networks, and determine target compression information corresponding to the current video frame according to the multiple pieces of generated compression information. Since the implementation processes of the foregoing two implementation manners are different, they will be described separately below.
- the encoder first selects the target neural network from multiple neural networks
- the encoder first uses a network selection strategy to select a target neural network from multiple neural networks for processing the current video frame.
- a network selection strategy for any video frame in the current video sequence (that is, the current video frame in Figure 2), the encoder will select a target neural network from multiple neural networks according to the network selection strategy, and use the target neural network.
- the neural network compresses and encodes the current video frame, and obtains target compression information corresponding to the current video frame.
- FIG. 3 is a schematic flowchart of a video frame compression method provided by an embodiment of the present application.
- the video frame compression method provided by the embodiment of the present application may include:
- the encoder determines a target neural network from multiple neural networks according to a network selection strategy.
- the encoder is configured with multiple neural networks, and the multiple neural networks include at least a first neural network, a second neural network, or other neural networks for performing compression operations, the first neural network, the second neural network Networks and other types of neural networks are neural networks that have performed a training operation.
- the encoder can determine the target neural network from multiple neural networks according to the network selection strategy, and compress and encode the current video frame through the target neural network,
- the target compression information refers to the compression information that the encoder finally decides to send to the decoder, that is, the target compression information is generated by a target neural network among multiple neural networks.
- video coding generally refers to the processing of image sequences that form a video or a video sequence.
- picture In the field of video coding, the terms “picture”, “video frame” or “image” may be used as synonyms.
- Video encoding is performed on the source side and typically involves processing (eg, compressing) the original video frame to reduce the amount of data required to represent the video frame (and thus store and/or transmit more efficiently).
- Video decoding is performed on the destination side and typically involves inverse processing relative to the encoder to reconstruct video frames.
- the encoding part and the decoding part are also collectively referred to as codec (encoding and decoding, CODEC).
- the network selection strategy is related to any one or more of the following factors: the position information of the current video frame or the amount of data carried by the current video frame.
- step 301 may include: the encoder may obtain position information of the current video frame in the current video sequence, where the position information is used to indicate that the current video frame is the Xth frame of the current video sequence; the encoder selects according to the network The strategy is to select a target neural network corresponding to the position information of the current video sequence from a plurality of neural networks including the first neural network and the second neural network.
- the position information of the current video frame in the current video sequence may specifically be represented as an index number, and the index number may be specifically represented in the form of a character string.
- the index number of the current video frame may specifically be 00000223, 00000368 or other Strings, etc., not exhaustive here.
- the network selection strategy may be to alternately select the first neural network or the second neural network according to a certain rule, that is, the encoder uses the first neural network to compress and encode n video frames of the current video frame, and then uses the second neural network. compressing and encoding the m video frames of the current video frame; or, after the encoder uses the second neural network to compress and encode the m video frames of the current video frame, and then uses the first neural network to compress and encode the n video frames of the current video frame Video frames are compressed and encoded.
- the values of n and m may both be integers greater than or equal to 1, and the values of n and m may be the same or different.
- the values of n and m are both 1, and the network selection strategy may be to use the first neural network to compress and encode the odd-numbered frames in the current video sequence, and use the second neural network to compress the even-numbered frames in the current video sequence. Compression coding is performed; alternatively, the network selection strategy may be to use the second neural network to perform compression coding on odd-numbered frames in the current video sequence, and use the first neural network to perform compression coding on even-numbered frames in the current video sequence.
- the value of n is 1 and the value of m is 3.
- the network selection strategy may be that after each video frame in the current video sequence is compressed and encoded by the first neural network, the second The neural network compresses and encodes three consecutive video frames in the current video sequence, etc., which will not be exhaustive here.
- FIG. 4 is a schematic diagram of the correspondence between the position of the current video frame and the adopted target neural network in the video frame compression method provided by the embodiment of the present application.
- the encoder uses the first neural network to compress and encode the t-th video frame, and then uses the second neural network to compress and encode the t-th video frame.
- the network compresses and encodes the t+1 frame, the t+2 frame and the t+3 frame video frame respectively, and uses the first neural network again to compress and encode the t+4 frame.
- the network After the network performs one compression encoding on one current video frame, it will use the second neural network to perform one compression encoding on three current video frames. It should be understood that the example in FIG. .
- step 301 may include: the encoder may acquire attributes of the current video frame, and select a target neural network from the first neural network and the second neural network, wherein the attributes of the current video frame are used to indicate the current
- the amount of data carried by the video frame, and the attributes of the current video frame include any one or a combination of the following: entropy, contrast, saturation, and other types of attributes of the current video frame, which are not exhaustive here.
- the higher the entropy of the current video frame the greater the amount of data carried by the current video frame, the greater the probability that the target neural network uses the second neural network, the lower the entropy of the current video frame, and the target neural network uses the second neural network.
- the target neural network is selected from multiple neural networks according to the position information of the current video frame in the current video sequence; or, the target may be selected from multiple neural networks according to at least one attribute of the current video
- a neural network can be used to generate the compression information of the current video frame by using the target neural network, which provides a variety of simple and easy-to-operate implementation schemes and improves the implementation flexibility of the scheme.
- the encoder can arbitrarily select one neural network from the first neural network and the second neural network as the target neural network, so as to use the target neural network to generate target compression information of the current video frame.
- the encoder can configure the first selection probability of the first neural network and the second selection probability of the second neural network respectively, and the value of the second selection probability is greater than or equal to the first selection probability, and then according to the first selection probability and the second selection probability to perform the selection operation of the target neural network.
- the value of the first selection probability is 0.2
- the value of the second selection probability is 0.8
- the value of the first selection probability is 0.3
- the value of the second selection probability is 0.7, etc.
- the values of the first selection probability and the second selection probability are not exhaustively enumerated here.
- the encoder compresses and encodes the current video frame through the target neural network, so as to obtain target compression information corresponding to the current video frame.
- the target neural network may be a first neural network, a second neural network, or other networks for compressing video frames, or the like. If the target compression information is obtained through the first neural network, the target compression information includes the first compression information of the first feature of the current video frame, the reference frame of the current video frame is used for the compression process of the first feature of the current video frame, and the current video frame The reference frame of the video frame is not used in the generation process of the first feature of the current video frame.
- the reference frame of the current video frame and the current video frame are both derived from the current video sequence; the current video frame is the original video frame included in the current video sequence.
- the reference frame of the current video frame may be an original video frame in the current video sequence, and the sorting position of the reference frame in the current video sequence may be located before the current video frame, or may be located after the current video frame, That is, when the current video sequence is played, the reference frame may appear earlier than the current video frame, or may be later than the current video frame.
- the reference frame of the current video frame may not be the original video frame in the current video sequence, and the sorting position of the original reference frame corresponding to the reference frame of the current video frame in the current video sequence may be located in the current video frame frame before, or after the current video frame.
- the reference frame of the current video frame may be a video frame obtained after the encoder performs transform coding on the original reference frame and inversely transforms and decodes it; The resulting video frame after decompression.
- the aforementioned compression operation may be implemented by a first neural network, or may be implemented by a second neural network.
- the first neural network may at least include an encoding (Encoding) network and an entropy encoding layer, wherein the first feature of the current video frame is obtained from the current video frame through the encoding network. ; perform the compression process of the first feature of the current video frame by using the reference frame of the current video frame through the entropy coding layer, and output the first compression information corresponding to the current video frame.
- Encoding encoding
- entropy encoding layer wherein the first feature of the current video frame is obtained from the current video frame through the encoding network.
- FIG. 5a is a schematic structural diagram of the first neural network provided by the embodiment of the present application.
- the current video frame is encoded through an encoding network and quantized to obtain the first feature of the current video frame.
- the reference frame of the current video frame through the entropy coding layer, compressing the first feature of the current video frame, and outputting the first compression information corresponding to the current video frame (that is, an example of the target compression information corresponding to the current video frame).
- the example in FIG. 5a is only to facilitate understanding of the present solution, and is not intended to limit the present solution.
- the encoder is directed to the process in which the encoder generates the first compression information corresponding to the current video frame through the first neural network.
- the encoder can transform and encode the current video frame through the first encoding network (Encoding Network). After transform encoding, it will quantize to obtain the first feature of the current video frame. It can be obtained based on the current video frame, and the reference frame of the current video frame does not need to be used in the generation process of the first feature.
- the first encoding network may specifically be represented as a multi-layer convolutional network.
- the first feature includes the features of M pixels, which can be expressed as an L-dimensional tensor. tensor or higher-dimensional tensor, etc., no exhaustive list is made here.
- the encoder predicts the features of the current video frame according to the N reference frames of the current video frame to generate the first predicted feature of the current video frame, and generates the first feature of the current video frame according to the first predicted feature of the current video frame probability distribution of .
- the encoder performs entropy coding on the first feature of the current video frame according to the probability distribution of the first feature of the current video frame to obtain the first compression information.
- the first prediction feature of the current video frame is the prediction result of the first feature of the current video frame
- the first prediction feature of the current video frame also includes the features of M pixels
- the first prediction feature of the current video frame may specifically represent is a tensor
- the data shape of the first prediction feature of the current video frame is the same as the data shape of the first feature of the current video frame
- the shape of the first prediction feature and the first feature are the same, which means the first prediction feature and the first feature
- Both are L-dimensional tensors
- the first dimension in the L-dimension of the first prediction feature and the second dimension in the L-dimension of the first feature are the same size
- L is an integer greater than or equal to 1
- the first dimension is the first dimension.
- the second dimension is the same dimension as the first dimension in the L dimension of the first feature.
- the probability distribution of the first feature of the current video frame includes the mean value of the first feature of the current video frame and the variance of the first feature of the current video frame.
- both the mean value of the first feature and the method of the first feature can be expressed as an L-dimensional tensor
- the data shape of the mean value of the first feature is the same as the data shape of the first feature
- the shape of the variance of the first feature is the same as that of the first feature.
- the data shapes of a feature are the same, so that the mean of the first feature includes a value corresponding to each of the M pixels, and the variance of the first feature includes a value corresponding to each of the M pixels.
- the encoder predicts the features of the current video frame according to the N reference frames of the current video frame to generate the first prediction feature of the current video frame.
- the encoder predicts the features of the current video frame according to the N reference frames of the current video frame to generate the first prediction feature of the current video frame.
- the features of the first video frame are predicted based on N second video frames, so as to generate the first predicted features of the first video frame, and according to the first video frame
- the first predicted feature of the frame generates a probability distribution of the first feature of the first video frame.
- the current video frame is predicted to generate the first prediction feature of the current video frame, and according to the first prediction feature of the current video frame, the current video frame is generated. The probability distribution of the first feature.
- the encoder since the encoder generates the probability distribution of the first feature of the current video frame according to the first prediction feature corresponding to the current video frame, and then according to the probability distribution of the first feature of the current video frame, the current video frame to compress and encode the first feature of the current video frame, so as to obtain the first compression information of the current video frame. Since the higher the similarity between the first prediction feature and the first feature, the higher the compression rate of the first feature will be, and finally obtained The first compression information of the current video frame will be smaller, and the first prediction feature of the current video frame is obtained by predicting the features of the current video frame according to the N reference frames of the current video frame, so as to improve the first prediction of the current video frame.
- the similarity between the feature and the first feature of the current video frame can reduce the size of the compressed first compressed information, that is, it can not only ensure the quality of the reconstructed frame obtained by the decoder, but also reduce the number of encoders and decoders.
- the target compression information includes the second compression information of the second feature of the current video frame, and the reference frame of the current video frame is used for the generation process of the second feature of the current video frame.
- the second neural network includes a convolutional network and an entropy coding layer, the convolutional network includes a plurality of convolutional layers and an excitation ReLU layer, wherein the generation of the second feature of the current video frame is performed by using the reference frame of the current video frame through the convolutional network In the process, the second feature of the current video frame is compressed by the entropy coding layer, and the second compression information corresponding to the current video frame is output.
- the encoder may perform compression encoding on the aforementioned optical flow to obtain the compressed optical flow.
- the second feature of the current video frame may only include the optical flow of the original current video frame relative to the reference frame of the current video frame.
- the encoder can also generate the predicted current video frame according to the optical flow of the original current video frame relative to the reference frame of the current video frame and the reference frame of the current video frame; the encoder calculates the original current video frame and the predicted current video frame.
- the residual between the current video frame and the optical flow between the original current video frame relative to the reference frame of the current video frame, and the residual between the original current video frame and the predicted current video frame are compressed encoding, and outputting the second compression information corresponding to the current video frame.
- the second feature of the current video frame includes the optical flow of the original current video frame relative to the reference frame of the current video frame and the residual between the original current video frame and the predicted current video frame.
- the encoder can directly perform a compression operation on the second feature of the current video frame to obtain a The second compression information corresponding to the video frame.
- the aforementioned compression operation may be implemented by a neural network or a non-neural network manner.
- the aforementioned compression coding manner may be entropy coding.
- FIG. 5b is a schematic structural diagram of the second neural network provided by the embodiment of the present application.
- the encoder inputs the current video frame and the reference frame of the current video frame to the convolutional network, and performs optical flow estimation through the convolutional network to obtain the optical flow of the current video frame relative to the reference frame of the current video frame.
- the encoder generates the reconstructed frame of the current video frame through the convolutional network according to the optical flow of the current video frame relative to the reference frame of the current video frame, and the reference frame of the current video frame; and obtains the reconstructed frame of the current video frame and the current video frame. residuals between.
- the encoder can compress the optical flow of the current video frame relative to the reference frame of the current video frame, and the residual difference between the reconstructed frame of the current video frame and the current video frame through the entropy coding layer, and output the second image of the current video frame.
- the example in FIG. 5b is only for the convenience of understanding this solution, and is not used to limit this solution.
- FIG. 5c is a schematic diagram of a comparison of the first feature and the second feature in the video frame compression method provided by the embodiment of the present application.
- Figure 5c includes two sub-diagrams (a) and (b), the sub-diagram (a) of Figure 5c represents a schematic diagram of generating the first feature of the current video frame, and the sub-diagram (b) of Figure 5c represents the generation of the current video frame A schematic diagram of the second feature of .
- the current video frame is input to the encoding network, and after transform encoding is performed by the encoding network, quantization (Q) is performed to obtain the first feature of the current video frame.
- the content in the dotted box in the sub-schematic diagram (b) of FIG. 5c represents the second feature of the current video frame, since the sub-schematic diagram (b) of FIG.
- the second feature of the video frame includes not only the optical flow of the original current video frame relative to the reference frame of the current video frame, but also the residual between the original current video frame and the predicted current video frame, which is not one by one here.
- the process of generating the second feature of the current video frame is described in detail.
- the encoder can also be configured with other neural networks for compressing and encoding video frames (hereinafter referred to as "fifth neural network" for convenience of description), but the encoder is configured with at least the first neural network and For the second neural network, the detailed process of using the first neural network and the second neural network to perform compression coding will be described in subsequent embodiments, and will not be introduced here for the time being.
- the fifth neural network can be a neural network that directly compresses the current video frame, that is, the encoder can input the current video frame into the fifth neural network, and directly compress the current video frame through the fifth neural network to obtain The third compression information corresponding to the current video frame output by the fifth neural network.
- the fifth neural network may specifically use a convolutional neural network.
- the encoder generates indication information corresponding to the target compression information, where the indication information is used to indicate that the target compression information is obtained through the target neural network in the first neural network and the second neural network.
- the encoder may further generate at least one indication information corresponding to the target compression information of at least one current video frame, the aforementioned at least one indication information It is used to indicate that each target compression information is obtained through the target neural network in the first neural network and the second neural network, that is, the indication information is used to indicate that a target compression information is obtained through the first neural network and the second neural network. which neural network obtained.
- the multiple indication information corresponding to the target compression information of multiple video frames in the current video sequence may specifically be expressed as a character string or other forms.
- the multiple indication information corresponding to the target compression information of multiple video frames in the current video sequence may specifically be 0010110101, and one character in the aforementioned character string represents one indication information, and when one indication information is 0, it represents The current video frame corresponding to the indication information is compressed by the first neural network; when one indication information is 1, it means that the current video frame corresponding to the indication information is compressed by the second neural network.
- the encoder may generate an indication information corresponding to the target compression information of a current video frame after obtaining the target compression information of a current video frame, that is, the encoder may alternately execute Step 303 and Steps 301 to 302.
- the encoder may also generate a preset number of indications corresponding to the preset number of current video frames after generating the target compression information of the preset number of current video frames in step 301.
- the preset number is an integer greater than 1. As an example, it may be 3, 4, 5, 6, or other values, which are not limited here.
- the encoder can also generate multiple target compression information corresponding to the entire current video sequence through steps 301 and 302, and then generate multiple indication information corresponding to the entire current video sequence through step 303.
- the implementation method is not limited here.
- the encoder sends target compression information of the current video frame.
- the encoder may send target compression information of at least one current video frame in the current video sequence to the decoder based on the constraints of a file transfer protocol (FTP).
- FTP file transfer protocol
- the encoder can directly send at least one target compressed information to the decoder; in another implementation, the encoder can also send at least one target compressed information to a server or a management center, etc.
- Intermediate device sent by the intermediate device to the decoder.
- the encoder can also generate the first prediction feature of the current video frame according to the method, While sending the first compression information of the current video frame to the decoder, send the first inter-frame side information, the second inter-frame side information, the first intra-frame side information, the second frame side information corresponding to the current video frame to the decoder One or two kinds of information in the inner side information; correspondingly, the decoder can receive the first inter-frame side information, the second inter-frame side information, the first frame inner side information, the second frame side information corresponding to the current video frame One or both of the inside information.
- the specific type of information to be sent needs to be determined in combination with which type of information is required in the process of decompressing the first compressed information of the current video frame.
- first inter-frame side information the second inter-frame side information, the first frame side information and the second frame side information
- the encoder sends indication information corresponding to the target compression information of the current video frame.
- step 305 is an optional step. If step 303 is not performed, step 305 is not performed, and if step 303 is performed, step 305 is performed. If step 305 is performed, then step 305 and step 304 may be performed simultaneously, that is, the encoder sends the target of at least one current video frame in the current video sequence to the decoder based on the constraints of the FTP protocol (that is, the file transfer protocol for short). compression information, and at least one indication information one-to-one corresponding to the target compression information of the at least one current video frame. Alternatively, step 304 and step 305 may also be executed separately, and the embodiment of the present application does not limit the execution order of step 304 and step 305 .
- the decoder can obtain multiple indication information corresponding to multiple target compression information, so that the decoder can know which of the first neural network and the second neural network is used for each video frame in the current video sequence.
- the neural network is used to perform the decompression operation, which is beneficial to improve the time for the decoder to decode the compressed information, that is, to improve the efficiency of the video frame transmission of the entire encoder and the decoder.
- the compression information when the compression information is obtained through the first neural network, the compression information carries the compression information of the first feature of the current video frame, and the reference frame of the current video frame is only used for the first feature of the current video frame.
- the feature compression process is not used for the generation process of the first feature of the current video frame, so that the decoder does not need to use the reference of the current video frame after performing the decompression operation according to the first compression information to obtain the first feature of the current video frame.
- the frame can get the reconstructed frame of the current video frame, so when the compression information is obtained through the first neural network, the quality of the reconstructed frame of the current video frame will not depend on the quality of the reconstructed frame of the reference frame of the current video frame, and then The accumulation of errors between frames is avoided to improve the quality of the reconstructed frame of the video frame; in addition, since the second feature of the current video frame is generated according to the reference frame of the current video frame, the second compression information of the second feature is determined by the second feature. The corresponding data volume is smaller than the data volume corresponding to the first compressed information of the first feature.
- the encoder can use the first neural network and the second neural network to process different video frames in the current video sequence to synthesize the first neural network. The advantages of the first neural network and the second neural network are to improve the quality of the reconstructed frame of the video frame on the basis of reducing the amount of data to be transmitted as much as possible.
- the target compression information is determined.
- the encoder first compresses and encodes the current video frame through a plurality of different neural networks, and then determines the target compression information corresponding to the current video frame.
- FIG. 6 is a schematic diagram of another principle of a video frame compression method provided by an embodiment of the present application.
- the encoder compresses and encodes the current video frame through the first neural network, and obtains the first compression information of the first feature of the current video frame. (that is, rp in FIG.
- a reconstructed frame of the current video frame (that is, d p in FIG. 6 ) is generated.
- the current video frame is compressed and encoded by the second neural network to obtain second compression information of the second feature of the current video frame (that is, r r in FIG. 6 ), and a reconstructed frame of the current video frame is generated according to the second compression information ( ie dr in Figure 6).
- the encoder determines the target compression information corresponding to the current video frame from the first compression information and the second compression information according to rp , d p , r r , d r and the network selection strategy. It should be understood that the example in FIG. 6 only For the convenience of understanding this scheme, it is not used to limit this scheme.
- FIG. 7a is another schematic flowchart of the video frame compression method provided by the embodiment of the present application.
- the video frame compression method provided by the embodiment of the present application may include:
- the encoder compresses and encodes the current video frame through the first neural network to obtain the first compression information of the first feature of the current video frame, and the reference frame of the current video frame is used for the compression process of the first feature of the current video frame. .
- the encoder may compress and encode the current video frame through the first neural network in the multiple neural networks, so as to obtain the first compression information of the first feature of the current video frame.
- the meaning of the first feature of the current video frame, the meaning of the first compression information of the first feature of the current video frame, and the specific implementation of step 701 can all refer to the description in the corresponding embodiment of FIG. 3 , and will not be repeated here. .
- the encoder generates a first video frame through the first neural network, where the first video frame is a reconstructed frame of the current video frame.
- the encoder may further perform decompression processing through the first neural network to generate the first video frame,
- the first video frame is a reconstructed frame of the current video frame.
- the first compression information includes compression information of the first feature of the current video frame, and the reference frame of the current video frame is used in the decompression process of the first compression information to obtain the first feature of the current video frame.
- a feature is used in the generation process of the reconstructed frame of the current video frame. That is, after decompressing the first compressed information, the encoder can obtain the reconstructed frame of the current video frame without using the reference frame of the current video frame.
- the first neural network may also include an entropy decoding layer and a decoding (Decoding) network, wherein the decompression process of the first compression information of the current video frame is performed by using the reference frame of the current video frame through the entropy decoding layer, and the current video frame is used through the decoding network.
- the first feature of the frame generates a reconstructed frame of the current video frame.
- the encoder can predict the features of the current video frame according to the reconstructed frames of the N reference frames of the current video frame through the entropy decoding layer, so as to obtain the first prediction feature of the current video frame, and use the entropy decoding layer according to the current video frame.
- the first predicted feature of the video frame generates a probability distribution of the first feature of the current video frame.
- the encoder performs entropy decoding on the first compressed information of the current video frame according to the probability distribution of the first feature of the current video frame through the entropy decoding layer, and obtains the first feature of the current video frame.
- the encoder also performs inverse transform decoding on the first feature of the current video frame through a first decoder network to obtain a reconstructed frame of the current video frame.
- the first decoding network corresponds to the first encoding network, and the first decoding network can also be expressed as a multi-layer convolutional network.
- the encoder generates the first prediction feature of the current video frame according to the reconstructed frames of the N reference frames of the current video frame, and the encoder generates the first prediction feature of the current video frame according to the N reference frames of the current video frame.
- the specific implementation mode of a prediction feature is similar; the specific implementation mode of the encoder generating the probability distribution of the first feature of the current video frame according to the first prediction feature of the current video frame is the same as the encoder according to the first prediction feature of the current video frame,
- the specific implementation manner of generating the probability distribution of the first feature of the current video frame is similar; for the specific implementation manner of the foregoing steps, reference may be made to the description of step 302 in the corresponding embodiment of FIG. 3 , which is not repeated here.
- the encoder compresses and encodes the current video frame through the second neural network to obtain the second compression information of the second feature of the current video frame, and the reference frame of the current video frame is used for the generation process of the second feature of the current video frame. .
- the encoder may compress and encode the current video frame through the second neural network in the plurality of neural networks, so as to obtain the second compression information of the second feature of the current video frame.
- the meaning of the second feature of the current video frame, the meaning of the second compression information of the second feature of the current video frame, and the specific implementation of step 701 can all be referred to the description in the corresponding embodiment of FIG. 3 , and will not be repeated here. .
- the encoder generates a second video frame through the second neural network, where the second video frame is a reconstructed frame of the current video frame.
- the encoder may further perform decompression processing through the second neural network to generate the second video frame,
- the second video frame is a reconstructed frame of the current video frame.
- the second neural network may also include an entropy decoding layer and a convolutional network, entropy decoding is performed on the second compressed information through the entropy decoding layer, and the convolutional network is performed by using the reference frame of the current video frame and the second feature of the current video frame. The generation process of the reconstructed frame of the current video frame.
- the encoder can perform entropy decoding on the second compressed information through the entropy decoding layer to obtain the second feature of the current video frame, that is, obtain the optical flow of the original current video frame relative to the reference frame of the current video frame;
- the second feature of the current video frame further includes a residual between the original current video frame and the predicted current video frame.
- the encoder predicts the current video frame according to the optical flow of the original current video frame relative to the reference frame of the current video frame and the reference frame of the current video frame, and obtains the predicted current video frame;
- the residual between the frame and the predicted current video frame and the predicted current video frame generate a second video frame (ie, a reconstructed frame of the current video frame).
- the encoder determines the target compression information corresponding to the current video frame according to the first compression information, the first video frame, the second compression information and the second video frame, wherein the determined target compression information is obtained through the first neural network.
- the determined target compression information is the first compression information; or, the determined target compression information is obtained through the second neural network, and the determined target compression information is the second compression information.
- the encoder may calculate the first score value corresponding to the first compression information (that is, the first score value corresponding to the first neural network) according to the first compression information and the first video frame, Two compressed information and a second video frame, calculate the second score value corresponding to the second compressed information (that is, the second score value corresponding to the second neural network), the encoder according to the first score value and the second score value, Determine the target compression information corresponding to the current video frame.
- the target compression information is the first compression information obtained through the first neural network
- the target neural network is the first neural network
- the target neural network is the second neural network.
- the first score value is used to reflect the performance of performing the compression operation on the current video frame by using the first neural network
- the second score value is used to reflect the performance of the compression operation performed on the current video frame by using the second neural network.
- the calculation process for the first scoring value and the second scoring value is performed after obtaining the first compression information and the first video frame.
- the encoder can obtain the data amount of the first compression information, calculate the first compression rate of the first compression information relative to the current video frame, and calculate the first video frame
- the first score value is generated according to the first compression rate of the first compression information relative to the current video frame and the image quality of the first video frame.
- the larger the data volume of the first compressed information the larger the value of the first score value
- the smaller the data volume of the first compressed information the smaller the value of the first score value.
- the first compression ratio of the first compression information relative to the current video frame may refer to a ratio between the data amount of the first compression information and the data amount of the current video frame.
- the encoder can calculate the structural similarity (structural similarity index, SSIM) between the current video frame and the first video frame to indicate the image quality of the first video frame according to the index of "structural similarity". It should be noted that, The encoder can also measure the image quality of the first video frame by other indicators, as an example, for example, the indicator of "structural similarity” can also be replaced by multiscale structural similarity index (MS-SSIM), Peak signal to noise ratio (peak signal to noise ratio, PSNR) or other indicators, etc., are not exhaustive here.
- MS-SSIM multiscale structural similarity index
- PSNR peak signal to noise ratio
- the encoder may perform a weighted sum of the first compression ratio and the image quality of the first video frame to generate a The first score value corresponding to the first neural network. It should be noted that, after obtaining the first compression rate and the image quality of the first video frame, the encoder may also obtain the first score value in other ways.
- the image quality multiplication, etc., specifically according to the first compression rate and the image quality of the first video frame to obtain the first score value, can be flexibly determined in combination with the actual application scenario, and the list is not exhaustive here.
- the encoder can calculate the data amount of the second compression information and the image quality of the second video frame, and then calculate the data amount of the second compression information and the second video frame according to the data amount of the second compression information and the second video frame.
- the image quality of the second score value is generated, and the generation method of the second score value is similar to the generation method of the first score value, which can be referred to the above description, and will not be repeated here.
- a process of determining target compression information corresponding to the current video frame according to the first score value and the second score value after obtaining and calculating the first score value corresponding to the first compressed information and the second score value corresponding to the second compressed information, the encoder can calculate the first score value and the second score value from the second score value. The target score value with a smaller value is selected from the score values, and the compressed information corresponding to the target score value is determined as the target compressed information. The encoder performs the foregoing operations on each video frame in the video sequence to obtain target compression information corresponding to each video frame.
- FIG. 7b is a schematic diagram of the first score value and the second score value in the video frame compression method provided by the embodiment of the present application.
- the abscissa of Fig. 7b represents the position information of a video frame in the current video sequence
- the ordinate of Fig. 7b represents the score value corresponding to each video frame
- A1 represents the evaluation of multiple video frames in the current video sequence.
- A2 represents the polyline corresponding to the second score value in the process of compressing multiple video frames in the current video sequence.
- A3 represents the first and second score values obtained when the first neural network and the second neural network are used to compress the video frame 1 respectively. It can be seen from Figure 7b that the first neural network is used to compress the video frame 1 The obtained score value is lower, so the encoder will use the first neural network to process video frame 1. After using the first neural network to process video frame 1, it will be compared with video frame 2 (that is, the video in the current video sequence). The first score value and the second score value corresponding to the next video frame of frame 1) both drop significantly; that is, every time a video frame is compressed by the first neural network, a new cycle will be triggered.
- the value of the first score value increases linearly
- the value of the second score value also increases linearly
- the growth rate of the second score value is higher than the growth rate of the first score value.
- l pi represents the starting point of the straight line corresponding to the multiple first score values in one cycle, that is, the offset of the first fitting formula corresponding to the multiple first score values
- k pi represents the corresponding one cycle.
- the slope of the straight line corresponding to the multiple first scoring values in the The number of video frames in the interval between video frames, as an example, for example, the value of t corresponding to the second video frame in a period is 1.
- multiple second score values can be fitted into the following formula:
- l pi represents the starting point of the straight line corresponding to the multiple second score values in one cycle, that is, the offset of the second fitting formula corresponding to the multiple second score values
- k pi represents the corresponding one cycle.
- the slope of the straight line corresponding to the plurality of second score values within is, that is, the coefficient of the second fitting formula corresponding to the plurality of second score values.
- the total score value corresponding to a cycle can be fitted into the formula:
- loss represents the sum of all score values in a cycle
- T represents the total number of video frames in a cycle
- the first T-1 video frames in a cycle are compressed by the second neural network, and the last video frame is compressed by the first neural network. Therefore, l pr +(l pr +k pr )+ ...+(l pr +(T-2)*k pr ) represents the sum of at least one second score value corresponding to all video frames compressed by the second neural network in one cycle, l pi +(T-1 )*k pi represents the first score value corresponding to the last video frame in each period.
- the encoder can use one cycle as the calculation unit, and the goal is to minimize the average value of the total score values in each cycle.
- the following is shown in the form of a formula:
- T and the meaning of loss can be referred to the above description of formula (3), which will not be repeated here.
- Equation (3) Substituting Equation (3) into Equation (4), the following formula can be obtained:
- the encoder first obtains two corresponding to the first two current video frames in a cycle.
- the encoder obtains two second score values corresponding to the first two current video frames in a cycle; for the first score value corresponding to one current video frame and the second score value corresponding to one current video frame.
- the encoder generates, according to the two first score values corresponding to the first two current video frames in a cycle, the coefficients and offset values of the first fitting formula corresponding to the multiple first score values in a cycle, That is, the values of l pi and k pi can be generated.
- the encoder generates, according to the two second score values corresponding to the first two current video frames in one cycle, the values of the coefficients and offsets of the second fitting formula corresponding to the plurality of second score values in one cycle, That is, the values of l pr and k pr can be generated.
- the encoder After obtaining the values of the coefficients and the offsets of the first fitting formula and the values of the coefficients and the offsets of the second fitting formulas, the process of determining the target compression information of the current video frame.
- the encoder determines the second compression information corresponding to the first video frame in a period as the current video frame (that is, the first video frame in a period ) of the target compression information, that is, the target neural network corresponding to the first video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 1.
- the encoder When t is equal to 1, that is, the encoder obtains two first score values corresponding to the first two current video frames in a cycle, and obtains two first score values corresponding to the first two current video frames in a cycle After the second score value, the value of T can be calculated based on formula (5). If T ⁇ 3, the encoder determines the first compression information corresponding to the second video frame in a period as the target compression information of the current video frame (that is, the second video frame in a period), and also That is, the target neural network corresponding to the second video frame in one cycle is the first neural network, and is triggered to enter the next cycle.
- the encoder determines the second compression information corresponding to the second video frame in a period as the target compression information of the current video frame (that is, the second video frame in a period), and That is, the target neural network corresponding to the second video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 2.
- the encoder When t is equal to 2, the encoder obtains the first score value and the second score value corresponding to the third video frame in a period (that is, an example of the current video frame), specifically the first score value corresponding to a current video frame The method of generating the first rating value and the second rating value will not be repeated here.
- the encoder recalculates the coefficients and offset values of the first fitting formula (that is, the recalculated values of lpi and kpi ) according to the three first score values corresponding to the first three video frames in a cycle.
- the coefficients and offset values of the second fitting formula (that is, the recalculated values of l pr and k pr )
- the value of T is recalculated according to the recalculated coefficients and offset values of the first fitting formula and the recalculated coefficients and offset values of the second fitting formula.
- the encoder may determine the first compression information corresponding to the third video frame in a period as the target compression of the current video frame (that is, the third video frame in a period).
- information, that is, the target neural network corresponding to the third video frame in a cycle is the first neural network, and is triggered to enter the next cycle.
- the encoder may determine the second compression information corresponding to the third video frame in a period as the target compression of the current video frame (that is, the third video frame in a period).
- information, that is, the target neural network corresponding to the third video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 3.
- the processing method of the encoder is similar to the processing method when t is equal to 2, which is not repeated here.
- the encoder determines the second compression information corresponding to the first video frame in a period as the current video frame (that is, the first video frame in a period frame), that is, the target neural network corresponding to the first video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 1.
- the encoder can obtain two first score values corresponding to the first two current video frames in a cycle, and obtain two first score values corresponding to the first two current video frames in a cycle
- the coefficients and offset values of the first fitting formula that is, the values of l pi and k pi
- the coefficients and offset values of the second fitting formula that is, l pi and k pi
- the encoder determines that the target compression information corresponding to the second video frame in the period is the first compression information of the current video frame, that is, the target compression information corresponding to the second video frame in the period
- the target neural network corresponding to the video frame is the first neural network and is triggered to enter a new cycle.
- the encoder may determine the first compression information corresponding to the second video frame in the period as the target compression information of the current video frame, that is, the first compression information corresponding to the second video frame in the period.
- the target neural network corresponding to the two video frames is the first neural network, and is triggered to enter a new cycle.
- the encoder may also determine the second compression information corresponding to the second video frame in the period as the target compression information of the current video frame, that is, the target neural network corresponding to the second video frame in the period
- the network is the second neural network and continues to process the case where t is equal to 2.
- the encoder determines the second compression information corresponding to the second video frame in the period as the target compression information of the current video frame, that is, the second compression information corresponding to the second video frame in the period
- the target neural network corresponding to the two video frames is the second neural network, and continues to process the case where t is equal to 2.
- the encoder can obtain a first score value corresponding to the third video frame in a cycle, and obtain a second score value corresponding to the first two current video frames in a cycle, specifically a How to generate the first score value and the second score value corresponding to the current video frame will not be described here.
- the encoder recalculates the coefficients and offset values of the first fitting formula (that is, the recalculated values of lpi and kpi ) according to the three first score values corresponding to the first three video frames in a cycle.
- the updated first average value is the average value of the total score value of the whole cycle obtained after the third video frame (that is, an example of the current video frame) in the cycle is compressed by the first neural network
- the updated second average value is to use the second neural network to compress the third video frame in the period (that is, an example of the current video frame), and use the first neural network to compress the fourth video frame in the period.
- the average value of the total score value of the whole cycle obtained after each video frame is compressed.
- the encoder determines that the target compression information corresponding to the third video frame in the period is the first compression information of the current video frame, that is, the target compression information corresponding to the third video frame in the period
- the target neural network corresponding to the third video frame in the cycle is the first neural network, and is triggered to enter a new cycle.
- the encoder may determine the first compression information corresponding to the third video frame in the period as the target compression information of the current video frame, that is, The target neural network corresponding to the third video frame in the cycle is the first neural network, and is triggered to enter a new cycle.
- the encoder may also determine the second compression information corresponding to the third video frame in the period as the target compression information of the current video frame, that is, the target neural network corresponding to the third video frame in the period
- the network is the second neural network and continues to process the case where t is equal to 3.
- the encoder determines the second compression information corresponding to the third video frame in the period as the target compression information of the current video frame, that is, The target neural network corresponding to the third video frame in this period is the second neural network, and continues to process the case where t is equal to 3.
- the processing method of the encoder is similar to the processing method when t is equal to 2, which is not repeated here.
- the technician finds the change rule of the first score value and the second score value in a single cycle during the research, and takes the minimum average value of the total score value in a cycle as the optimization goal, that is, in the When determining the target compression information corresponding to each current video frame, not only the score value of the current video frame, but also the average value of the score value in the whole period should be considered, so as to further reduce the correlation with all video frames in the entire current video sequence.
- the corresponding score value is used to further improve the performance of the compressed information corresponding to the entire current video sequence; in addition, two different implementation modes are provided, which improves the implementation flexibility of this solution.
- the encoder also uses one cycle as a calculation unit, and the goal is to minimize the average value of the total score values in each cycle.
- t is equal to 0 and 1
- the encoder may determine the second compression information corresponding to the third video frame in a period as the target compression of the current video frame (that is, the third video frame in a period).
- information, that is, the target neural network corresponding to the third video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 3.
- the processing method of the encoder is similar to the processing method when t is equal to 2, which is not repeated here.
- FIG. 7c is a method for compressing a video frame provided by an embodiment of the present application to calculate the coefficients of the first fitting formula and the values of the offset and the coefficients of the second fitting formula. and a schematic of the offset.
- FIG. 7c between the two vertical dashed lines represents the processing of video frames in one cycle, and one cycle includes the compression and encoding of multiple video frames through the second neural network, and the compression and encoding of multiple video frames through the first neural network.
- the network compresses and encodes the last video frame in the cycle.
- the encoder first obtains two first score values corresponding to the first two current video frames in a cycle (that is, the first video frame and the second video frame), and obtains the first two score values corresponding to the first two video frames in a cycle.
- the coefficients and offset values of the first fitting formula that is, the values of l pi and k pi
- the coefficients of the second fitting formula and The value of the offset that is, the values of l pr and k pr
- the coefficients and biases of the first fitting formula are calculated and obtained only according to the two first score values and the two second score values corresponding to the first two video frames in a cycle.
- the value of the offset, the coefficient of the second fitting formula and the value of the offset and then take the lowest average value of the total score value in the entire cycle as the optimization goal, and obtain the optimal number of video frames in the current cycle.
- the values of the coefficients and offsets of the first fitting formula and the values of the coefficients and offsets of the second fitting formula save the calculation time of the parameters of the first fitting formula and the second fitting formula, so that Improves the efficiency of generating compression information for the current video sequence.
- the encoder also uses one cycle as a calculation unit, and the goal is to minimize the average value of the total score values in each cycle.
- t is equal to 0 and 1
- the first score values corresponding to the three video frames that is, an example of the current video frame
- only the coefficients and offset values of the second fitting formula are recalculated, and the values of the first fitting formula are not recalculated.
- the encoder may determine the first compression information corresponding to the third video frame in a period as the target compression of the current video frame (that is, the third video frame in a period).
- information, that is, the target neural network corresponding to the third video frame in a cycle is the first neural network, and is triggered to enter the next cycle.
- the encoder may determine the second compression information corresponding to the third video frame in a period as the target compression of the current video frame (that is, the third video frame in a period).
- information, that is, the target neural network corresponding to the third video frame in a cycle is the second neural network, and continues to process the situation when t is equal to 3.
- the processing method of the encoder is similar to the processing method when t is equal to 2, which is not repeated here.
- the compression information that needs to be sent finally is selected;
- the network selection strategy determines the target neural network from the first neural network and the second neural network, and then uses the target neural network to generate the target compressed information, which can improve the performance of the compressed information corresponding to the entire current video sequence as much as possible.
- the encoder generates indication information corresponding to the target compression information, where the indication information is used to indicate that the target compression information is obtained through the target neural network in the first neural network and the second neural network.
- the encoder sends target compression information of the current video frame.
- the encoder sends indication information corresponding to the target compression information of the current video frame.
- steps 706 and 708 are mandatory steps.
- steps 706 to 708 reference may be made to the description of steps 303 to 305 in the corresponding embodiment of FIG. 3 , which will not be repeated here. It should be noted that this embodiment of the present application does not limit the execution order of steps 707 and 708. Steps 707 and 708 may be executed simultaneously, or step 707 may be executed first, and then step 708 may be executed, or step 708 may be executed first, and then step 707 may be executed. .
- the final compression information is selected from the first compression information and the second compression information.
- the performance of the compressed information corresponding to the entire current video sequence can be improved as much as possible.
- FIG. 8 is another schematic flowchart of a video frame compression method provided by an embodiment of the present application.
- the video frame compression method provided by the embodiment of the present application may include:
- the encoder compresses and encodes the third video frame through the first neural network to obtain first compression information corresponding to the third video frame, where the first compression information includes compression information of the first feature of the third video frame, and the first compression information includes the compression information of the first feature of the third video frame.
- the reference frame of the three video frames is used for the compression process of the first feature of the third video frame.
- the encoder when the encoder processes the third video frame in the current video frame, the encoder determines that the target compression information of the third video frame is the first compression information corresponding to the third video frame generated by the first neural network .
- the third video frame is a video frame in the current video sequence, and the concept of the third video frame is similar to that of the current video frame.
- the encoder compresses and encodes the fourth video frame through the second neural network to obtain second compression information corresponding to the fourth video frame, where the second compression information includes compression information of the second feature of the fourth video frame, and the first
- the reference frame of the four video frames is used in the process of generating the second feature of the fourth video frame, and the third video frame and the fourth video frame are different video frames in the same video sequence.
- the encoder when the encoder processes the fourth video frame in the current video frame, the encoder determines that the target compression information of the fourth video frame is the second compression information corresponding to the fourth video frame generated by the second neural network .
- the fourth video frame is a video frame in the current video sequence, the concept of the fourth video frame is similar to that of the current video frame, and the third video frame and the fourth video frame are different video frames in the same current video sequence.
- the meaning of the second feature of the fourth video frame please refer to the description of the meaning of the “second feature of the current video frame” in the corresponding embodiment of FIG. 3, the meaning of the “reference frame of the fourth video frame”, the encoder
- the specific implementation of generating the second compression information corresponding to the fourth video frame, and the specific implementation of the encoder to determine the compression information of the fourth video frame that needs to be sent to the decoder at last can refer to the description in the corresponding embodiment of FIG. 3 . No further elaboration here.
- Step 801 may be performed first, then step 802 may be performed, or step 802 may be performed first, and then step 801 may be performed, which needs to be determined in combination with the actual application scenario. , which is not limited here.
- the encoder generates indication information, where the indication information is used to indicate that the first compressed information is obtained through the first neural network and the second compressed information is obtained through the second neural network.
- the encoder may generate the target compression information corresponding to the one or more target compression information.
- One-to-one corresponding indication information wherein the target compression information is embodied as the first compression information or the second compression information.
- the encoder may first perform steps 801 and 802 multiple times, and then generate the indication information corresponding to the target compression information of each video frame in the entire current video sequence through step 803 one-to-one.
- the encoder may also perform step 803 every time step 801 is performed or after step 802 is performed once.
- the encoder may also perform step 803 once after performing step 801 and/or step 802 for a preset number of times, where the preset number of times is an integer greater than 1, for example, it may be 3, 4, 5 , 6 or other values, etc., are not limited here.
- Step 803 is a mandatory step. If in step 801 or 802, the encoder obtains the target compression information of the current video frame (that is, the third video frame or the fourth video frame) in the manner shown in the corresponding embodiment of FIG. 3, then step 803 is optional step.
- step 803 reference may be made to the description of step 303 in the corresponding embodiment of FIG. 3, which is not repeated here.
- the encoder sends target compression information corresponding to the current video frame, where the target compression information is the first compression information or the second compression information.
- the encoder is generating at least one first compression information corresponding to at least one third video frame one-to-one, and/or the encoder is generating at least one second compression information corresponding to at least one fourth video frame one-to-one.
- at least one target compression information ie, the first video frame
- at least one current video frame ie, the third video frame and/or the fourth compressed information and/or second compressed information.
- FIG. 9 is a schematic diagram of a video frame compression method provided by an embodiment of the present application.
- the encoder uses the third neural network to compress and encode some video frames in the current video sequence, and uses the fourth neural network to compress and encode another part of the video frames in the current video sequence, and then sends the same
- the target compression information corresponding to all current video frames in the sequence, the target compression information is the first compression information or the second compression information, it should be understood that the example in FIG.
- the encoder sends indication information corresponding to the current video frame.
- step 805 is an optional step. If step 803 is not performed, step 805 is not performed, and if step 803 is performed, step 805 is performed. If step 805 is performed, step 805 and step 804 may be performed simultaneously.
- step 805 reference may be made to the description of step 305 in the embodiment corresponding to FIG. 3 above, which is not repeated here.
- the first compression information carries the compression information of the first feature of the current video frame
- the The reference frame is only used in the compression process of the first feature of the current video frame, and is not used in the generation process of the first feature of the current video frame, so that the decoder performs a decompression operation according to the first compression information to obtain the first feature of the current video frame.
- the reconstructed frame of the current video frame can be obtained without the reference frame of the current video frame, so when the target compression information is obtained through the first neural network, the quality of the reconstructed frame of the current video frame will not depend on the current video frame.
- the quality of the reconstructed frame of the reference frame of the video frame thereby avoiding the accumulation of errors between frames, so as to improve the quality of the reconstructed frame of the video frame; when the fourth video frame is compressed and encoded by the second neural network, due to the The second feature of the four video frames is generated according to the reference frame of the fourth video frame.
- the amount of data corresponding to the second compression information is smaller than that of the first compression information.
- the first neural network and the second neural network are used. network to process different video frames in the current video sequence, in order to integrate the advantages of the first neural network and the second neural network, so as to reduce the amount of data to be transmitted as much as possible, and improve the reconstruction frame quality of the video frame. quality.
- FIG. 10a is a schematic flowchart of a method for decompressing a video frame provided by an embodiment of the present application. Compression methods can include:
- a decoder receives target compression information corresponding to at least one current video frame.
- the encoder may send at least one target compression information corresponding to at least one current video frame in the current video sequence to the decoder; correspondingly, the decoder may receive information corresponding to the current video sequence.
- One-to-one correspondence with at least one current video frame in at least one target compression information may be
- the decoder may directly receive target compression information corresponding to at least one current video frame from the encoder; in another implementation manner, the decoder may also receive target compression information from an intermediate device such as a server or a management center At , target compression information corresponding to at least one current video frame is received.
- the decoder receives indication information corresponding to the target compression information.
- the decoder receives at least one indication information corresponding to at least one target compressed information.
- the indication information reference may be made to the description in the corresponding embodiment of FIG. 3 , which is not repeated here.
- step 1002 is an optional step. If step 1002 is executed, the embodiment of the present application does not limit the execution order of steps 1001 and 1002, and steps 1001 and 1002 may be executed simultaneously.
- the decoder selects a target neural network corresponding to the current video frame from multiple neural networks, where the multiple neural networks include a third neural network and a fourth neural network.
- the decoder after obtaining at least one target compression information corresponding to at least one current video frame, the decoder needs to perform a decompression operation by selecting a target neural network from multiple neural networks to obtain each reconstructed frame of the current video frame.
- the plurality of neural networks include a third neural network and a fourth neural network, and both the third neural network and the fourth neural network are neural networks for performing decompression operations.
- the third neural network corresponds to the first neural network, that is, if the target compression information of a current video frame is the first compression information of the current video frame obtained through the first neural network, the decoder needs to pass the third compression information.
- the neural network performs a decompression operation on the first compressed information of the current video frame to obtain a reconstructed frame of the current video frame.
- the fourth neural network corresponds to the second neural network, that is, if the target compression information of a current video frame is the second compression information of the current video frame obtained through the second neural network, the decoder needs to pass the fourth neural network to A decompression operation is performed on the second compressed information of the current video frame to obtain a reconstructed frame of the current video frame.
- step 1002 the decoder can directly determine the target neural network corresponding to each target compressed information as the first target neural network according to the plurality of indication information corresponding to the plurality of target compressed information one-to-one. Which neural network of the first neural network and the second neural network.
- FIG. 10b is another schematic flowchart of a method for decompressing a video frame provided by an embodiment of the present application.
- the decoder obtains the target compression information corresponding to the current video frame and the indication information corresponding to the target compression information, it can obtain the target compression information corresponding to the target compression information, and can obtain the target compression information from the third neural network and the fourth neural network according to the indication information corresponding to the target compression information.
- Determine the target neural network in and use the target neural network to decompress the target compression information corresponding to the current video frame to obtain the reconstructed frame of the current video frame.
- the example in Figure 10b is only for the convenience of understanding this scheme, not used for Limited to this program.
- the decoder may obtain the position information of the current video frame in the current video sequence corresponding to each target compression information one-to-one, and the position information is used to indicate the position information related to each target
- the current video frame corresponding to the one-to-one compression information is the Xth frame in the current video sequence; the decoder selects the target neural network corresponding to the position information of the current video sequence from the third neural network and the fourth neural network according to the preset rule.
- the preset rule may be to alternately select the third neural network or the fourth neural network according to a certain rule, that is, the decoder uses the third neural network to compress and encode n video frames of the current video frame, and then uses the fourth neural network. compressing and encoding the m video frames of the current video frame; or, after the encoder uses the fourth neural network to compress and encode the m video frames of the current video frame, and then uses the third neural network to compress and encode the n video frames of the current video frame Video frames are compressed and encoded.
- the values of n and m may both be integers greater than or equal to 1, and the values of n and m may be the same or different.
- the decoder selects a specific implementation manner of the target neural network corresponding to the position information of the current video sequence from a plurality of neural networks including the third neural network and the fourth neural network according to the preset rules, and the encoder selects the strategy according to the network,
- the specific implementation of selecting the target neural network corresponding to the position information of the current video sequence from multiple neural networks including the first neural network and the second neural network is similar.
- the difference is that the “first neural network” in the corresponding embodiment of FIG. "neural network” is replaced with “third neural network” in this embodiment, and “second neural network” in the corresponding embodiment of Fig. 3 is replaced with "fourth neural network” in this embodiment, you can directly refer to the above figure 3 corresponds to the description in the embodiment, and is not repeated here.
- the decoder performs a decompression operation through the target neural network according to the target compression information to obtain a reconstructed frame of the current video frame, wherein, if the target neural network is the third neural network, the target compression information includes the first video frame of the current video frame.
- the first compressed information of the feature, the reference frame of the current video frame is used for the decompression process of the first compressed information to obtain the first feature of the current video frame, and the first feature of the current video frame is used for the reconstruction of the current video frame.
- Generation process if the target neural network is the fourth neural network, the target compression information includes the second compression information of the second feature of the current video frame, and the second compression information is used for the decoder to perform a decompression operation to obtain the current video frame.
- the second feature, the reference frame of the current video frame and the second feature of the current video frame are used in the generation process of the reconstructed frame of the current video frame.
- the target compression information includes the first compression information of the first feature of the current video frame
- the third neural network includes an entropy decoding layer and a decoding network; wherein, through entropy decoding The layer performs an entropy decoding process of the first compressed information of the current video frame using the reference frame of the current video frame, and generates a reconstructed frame of the current video frame using the first feature of the current video frame through the decoding network.
- the specific implementation of the decoder performing step 1004 can refer to the description of step 702 in the corresponding embodiment of FIG. 7a, the difference is that in step 702, the encoder The first compression information corresponding to the video frame is decompressed through the first neural network to obtain the reconstructed frame of the current video frame; in step 1004, the decoder is based on the first compression information corresponding to the current video frame, through the third neural network. The network performs decompression processing to obtain the reconstructed frame of the current video frame.
- the target compression information includes the second compression information of the second feature of the current video frame
- the fourth neural network includes an entropy decoding layer and a convolutional network
- Entropy decoding is performed on the compressed information, and a process of generating a reconstructed frame of the current video frame is performed by using the reference frame of the current video frame and the second feature of the current video frame through a convolutional network.
- the specific implementation of the decoder performing step 1004 can refer to the description of step 704 in the corresponding embodiment of FIG. 7a, the difference is that in step 704, the encoder The second compression information corresponding to the video frame is decompressed through the second neural network to obtain the reconstructed frame of the current video frame; in step 1004, the decoder passes the fourth neural network according to the second compression information corresponding to the current video frame. The network performs decompression processing to obtain the reconstructed frame of the current video frame.
- FIG. 11 is a schematic flowchart of another method for decompressing a video frame provided by an embodiment of the present application.
- Frame decompression methods can include:
- the decoder receives target compression information corresponding to the current video frame, where the target compression information is the first compression information or the second compression information.
- the decoder receives indication information corresponding to the current video frame, where the indication information is used to instruct the first compressed information to be decompressed by the third neural network and the second compressed information to be decompressed by the fourth neural network.
- steps 1101 and 1102 for the specific implementation manner of steps 1101 and 1102, reference may be made to the description of steps 1001 and 1002 in the corresponding embodiment of FIG. 10a, and details are not described here.
- the decoder decompresses the first compressed information of the third video frame through the third neural network to obtain a reconstructed frame of the third video frame.
- the decoder selects the third neural network from multiple neural networks to decompress the first compressed information of the third video frame, "selects the first compressed information corresponding to the third video frame from the multiple neural networks"
- the decoder selects the third neural network from multiple neural networks to decompress the first compressed information of the third video frame, "selects the first compressed information corresponding to the third video frame from the multiple neural networks"
- the third neural network includes an entropy decoding layer and a decoding network.
- the entropy decoding layer uses the reference frame of the current video frame to perform an entropy decoding process of the first compressed information of the current video frame
- the decoding network uses the first feature of the current video frame. Generate a reconstructed frame of the current video frame.
- the first compression information includes compression information of the first feature of the third video frame, and the reference frame of the third video frame is used in the decompression process of the first compression information to obtain the first feature of the third video frame, the third video frame.
- the first feature of is used in the generation process of the reconstructed frame of the third video frame, and both the reconstructed frame of the third video frame and the reference frame of the third video frame are included in the current video sequence. That is, after decompressing the first compressed information, the decoder can obtain the reconstructed frame of the third video frame without using the reference frame of the third video frame.
- the meaning of "the first feature of the third video frame” can be understood by referring to the above-mentioned meaning of "the first feature of the current video frame”
- the meaning of "the reference frame of the third video frame” can be understood by referring to the above-mentioned “current video frame”.
- the meaning of the "reference frame of the video frame” is understood, and will not be repeated here.
- the reconstructed frame of the third video frame refers to a video frame corresponding to the third video frame obtained by performing a decompression operation using the first compression information.
- the decoder decompresses the second compression information of the fourth video frame through the fourth neural network, so as to obtain a reconstructed frame of the fourth video frame.
- the decoder selects the fourth neural network to decompress the first compressed information of the fourth video frame from multiple neural networks, and "selects the first compressed information corresponding to the fourth video frame from the multiple neural networks.
- “Four Neural Networks” reference may be made to the description in step 1003 in the corresponding embodiment of FIG. 10a, which will not be repeated here.
- the fourth neural network includes an entropy decoding layer and a convolutional network, entropy decoding is performed on the second compressed information through the entropy decoding layer, and the current video is executed by using the reference frame of the current video frame and the second feature of the current video frame through the convolutional network.
- Frame Reconstruction Frame generation process For a specific implementation manner of the decoder decompressing the second compressed information of the fourth video frame through the fourth neural network, reference may be made to the description of step 704 in the corresponding embodiment of FIG. 7a, and details are not repeated here.
- the second compression information includes compression information of the second feature of the fourth video frame, and the second compression information is used for the decoder to perform a decompression operation to obtain the second feature of the fourth video frame, the reference frame of the fourth video frame and the second feature of the fourth video frame.
- the second feature of the four video frames is used in the process of generating the reconstructed frame of the fourth video frame, and both the reconstructed frame of the fourth video frame and the reference frame of the fourth video frame are included in the current video sequence.
- the meaning of “the second feature of the fourth video frame” can be understood by referring to the above-mentioned meaning of “the second feature of the current video frame”
- the meaning of “the reference frame of the fourth video frame” can be understood by referring to the above-mentioned “current video frame”.
- the meaning of the "reference frame of the video frame” is understood, and will not be repeated here.
- the reconstructed frame of the fourth video frame refers to a video frame corresponding to the fourth video frame obtained by performing a decompression operation using the second compression information.
- FIG. 12 is a schematic flowchart of a training method for a video frame compression and decompression system provided by an embodiment of the present application.
- the video frame compression and decompression system training method provided by the embodiment of the present application may include:
- the training device compresses and encodes the first training video frame by using the first neural network, so as to obtain first compression information corresponding to the first training video frame.
- a training data set is pre-stored in the training device, and the training data set includes a plurality of first training video frames.
- step 1201 please refer to the description of step 801 in the corresponding embodiment of FIG. 8 , here I won't go into details.
- the training device does not need to The step of selecting the target neural network in the second neural network, or in other words, in step 1201, the training device does not need to perform the step of selecting the target compressed information from the first compressed information and the second compressed information.
- the training device decompresses the first compression information of the first training video frame by using a third neural network to obtain a first training reconstruction frame.
- step 1202 for the specific implementation manner of the training device performing step 1202, reference may be made to the description of step 1103 in the corresponding embodiment of FIG. 11 , which is not repeated here. The difference is that, firstly, the "third video frame" in step 1103 is replaced with the "first training video frame” in this embodiment; secondly, in step 1202, the training device does not need to perform the transformation from the third neural network to the first training video frame. The steps of selecting the target neural network in the four neural networks.
- the training device trains the first neural network and the third neural network according to the first training video frame, the first training reconstruction frame, the first compression information, and the first loss function until the preset conditions are met.
- the training device may, according to the first training video frame, the first training reconstruction frame, and the first compression information corresponding to the first training video frame, use the first loss function to analyze the first neural network and the third neural network.
- the network is iteratively trained until the convergence condition of the first loss function is satisfied.
- the first loss function includes the loss item of the similarity between the first training video frame and the first training reconstruction frame and the loss item of the data size of the first compression information of the first training video frame
- the first training reconstruction frame is Reconstructed frame of the first training video frame.
- the training objective of the first loss function includes bridging the similarity between the first training video frame and the first training reconstruction frame.
- the training objective of the first loss function further includes reducing the size of the first compressed information of the first training video frame.
- the first neural network refers to a neural network used in the process of compressing and encoding video frames; the second neural network refers to a neural network that performs decompression operations based on compressed information.
- the training device can calculate the function value of the first loss function according to the first training video frame, the first training reconstruction frame, and the first compression information corresponding to the first training video frame, and according to the function value of the first loss function
- the gradient value is generated, and then the weight parameters of the first neural network and the third neural network are updated in the reverse direction, so as to complete the training of the first neural network and the third neural network. Iterative training of the first neural network and the third neural network.
- the training device compresses and encodes the second training video frame through the second neural network according to the reference frame of the second training video frame, to obtain second compression information corresponding to the second training video frame, the reference of the second training video frame Frames are video frames processed by the first neural network after training.
- step 1202 for a specific implementation manner of performing step 1202 by the training device, reference may be made to the description of step 802 in the corresponding embodiment of FIG. 8 , which will not be repeated here.
- the difference is that, first, the "fourth video frame" in step 802 is replaced with the "second training video frame" in this embodiment; second, in step 1204, the training device does not need to The step of selecting the target neural network in the second neural network, or in other words, in step 1204, the training device does not need to perform the step of selecting the target compressed information from the first compressed information and the second compressed information.
- the reference frame of the second training video frame may be the original video frame in the training data set, or may be the video frame processed by the mature first neural network (that is, the first neural network that has performed the training operation). .
- the training device can input the original reference frame of the second training video frame into the mature third In the first encoding network of a neural network (that is, the first neural network that has performed the training operation), the encoding operation is performed on the second training video frame to obtain the encoding result, and the aforementioned encoding result is input into the mature third neural network In the first decoding network (that is, the third neural network that has performed the training operation), the decoding operation is performed on the encoding result to obtain the processed reference frame of the second training video frame. Further, the training device inputs the processed reference frame of the second training video frame and the second training video frame into the second neural network, so as to generate the second compression information corresponding to the second training video frame through the second neural network.
- the training device may input the original reference frame of the second training video frame into the mature first neural network, so as to generate the original reference frame of the second training video frame through the mature first neural network.
- Corresponding first compression information and use a mature third neural network to perform a decompression operation according to the first compression information corresponding to the original reference frame of the second training video frame, and obtain the processed reference of the second training video frame. frame.
- the training device inputs the processed reference frame of the second training video frame and the second training video frame into the second neural network, so as to generate the second compression information corresponding to the second training video frame through the second neural network.
- the reference frame used by the second neural network since in the execution stage, the reference frame used by the second neural network may be processed by the first neural network, the reference frame processed by the first neural network is used to execute the execution on the second neural network.
- the training operation is conducive to maintaining the consistency of the training phase and the execution phase, so as to improve the accuracy of the execution phase.
- the training device decompresses the second compression information of the second training video frame through the fourth neural network to obtain the second training reconstruction frame.
- step 1202 for a specific implementation manner of the training device performing step 1202, reference may be made to the description of step 1104 in the corresponding embodiment of FIG. 11 , which will not be repeated here. The difference is that, first, the "fourth video frame" in step 1104 is replaced with the "second training video frame" in this embodiment; secondly, in step 1205, the training device does not need to perform the transformation from the third neural network and the first The steps of selecting the target neural network in the four neural networks.
- the training device trains the second neural network and the fourth neural network according to the second training video frame, the second training reconstruction frame, the second compression information, and the second loss function until the preset conditions are met.
- the training device may, according to the second training video frame, the second training reconstruction frame, and the second compression information corresponding to the second training video frame, use the second loss function to analyze the second neural network and the fourth neural network.
- the network is iteratively trained until the convergence condition of the second loss function is satisfied.
- the second loss function includes the loss item of the similarity between the second training video frame and the second training reconstruction frame and the loss item of the data size of the second compression information of the second training video frame
- the second training reconstruction frame is Reconstructed frame of the second training video frame.
- the training objective of the second loss function includes bridging the similarity between the second training video frame and the second training reconstruction frame.
- the training objective of the second loss function further includes reducing the size of the second compression information of the second training video frame.
- the second neural network refers to a neural network used in the process of compressing and encoding video frames; the fourth neural network refers to a neural network that performs decompression operations based on compressed information.
- the training device can calculate the function value of the second loss function according to the second training video frame, the second training reconstruction frame and the second compression information corresponding to the second training video frame, and according to the function value of the second loss function
- the gradient value is generated, and then the weight parameters of the second neural network and the fourth neural network are updated in the reverse direction, so as to complete the training of the second neural network and the fourth neural network. Iterative training of the second neural network and the fourth neural network.
- the independent neural network module refers to a neural network module with independent functions.
- the first coding network in the first neural network is an independent neural network module.
- the second neural network The first decoding network in is an independent neural network module.
- the first neural network after training and the third neural network after training can be used first.
- the neural network initializes the parameters of the second neural network and the fourth neural network, that is, assigns the parameters in the first neural network after training and the third neural network after training to the same neural network module described above, and sets the parameters in the second neural network.
- the training process of the second neural network and the fourth neural network keep the parameters of the same neural network module above unchanged, and adjust the parameters of the second neural network and the remaining neural network modules in the fourth neural network to reduce the second neural network and the fourth neural network.
- the total duration of the training process of the fourth neural network improves the training efficiency of the second neural network and the fourth neural network.
- the second neural network is used to perform the compression operation on a video frame as an example, and the experimental data is shown in Table 1 below.
- FIG. 13 is a system architecture diagram of the video encoding and decoding system provided by the embodiment of the present application, and FIG. 13 is an exemplary video encoding and decoding system.
- a schematic block diagram of system 10, video encoder 20 (or encoder 20 for short) and video decoder 30 (or decoder 30 for short) in video codec system 10 represent Examples of devices that perform each technique, etc.
- the video codec system 10 includes a source device 12 for supplying encoded image data 21 such as encoded images to a destination device 14 for decoding the encoded image data 21.
- the source device 12 includes an encoder 20 and, alternatively, an image source 16 , a preprocessor (or preprocessing unit) 18 such as an image preprocessor, and a communication interface (or communication unit) 22 .
- Image source 16 may include or be any type of image capture device for capturing real-world images, etc., and/or any type of image generation device, such as a computer graphics processor or any type of user for generating computer animation images. Devices used to acquire and/or provide real-world images, computer-generated images (e.g., screen content, virtual reality (VR) images, and/or any combination thereof (e.g., augmented reality, AR) images).
- the image source may be any type of memory or storage that stores any of the above-mentioned images.
- the image (or image data 17 ) may also be referred to as the original image (or original image data) 17 .
- the preprocessor 18 is used to receive the (raw) image data 17 and preprocess the image data 17 to obtain a preprocessed image (or preprocessed image data) 19 .
- the preprocessing performed by the preprocessor 18 may include trimming, color format conversion (eg, from RGB to YCbCr), toning, or denoising. It is understood that the preprocessing unit 18 may be an optional component.
- a video encoder (or encoder) 20 is used to receive preprocessed image data 19 and provide encoded image data 21 .
- the communication interface 22 in the source device 12 can be used to: receive the encoded image data 21 and send the encoded image data 21 (or any other processed version) over the communication channel 13 to another device such as the destination device 14 or any other device for storage or rebuild directly.
- the destination device 14 includes a decoder 30 and may additionally, alternatively, include a communication interface (or communication unit) 28 , a post-processor (or post-processing unit) 32 and a display device 34 .
- the communication interface 28 in the destination device 14 is used to receive the encoded image data 21 (or any other processed version) directly from the source device 12 or from any other source device such as a storage device, for example, the storage device is an encoded image data storage device, The encoded image data 21 is supplied to the decoder 30 .
- Communication interface 22 and communication interface 28 may be used through a direct communication link between source device 12 and destination device 14, such as a direct wired or wireless connection, etc., or through any type of network, such as a wired network, a wireless network, or any Combination, any type of private network and public network, or any type of combination, send or receive encoded image data (or encoded data) 21 .
- the communication interface 22 may be used to encapsulate the encoded image data 21 into a suitable format such as a message, and/or use any type of transfer encoding or processing to process the encoded image data for transmission over a communication link or communication network transfer on.
- the communication interface 28 corresponds to the communication interface 22 and may be used, for example, to receive transmission data and process the transmission data using any type of corresponding transmission decoding or processing and/or decapsulation to obtain encoded image data 21 .
- Both the communication interface 22 and the communication interface 28 can be configured as a one-way communication interface as indicated by the arrow in FIG. 13 from the corresponding communication channel 13 of the source device 12 to the destination device 14, or a two-way communication interface, and can be used to send and receive messages etc. to establish a connection, acknowledge and exchange any other information related to a communication link and/or data transfer such as encoded image data transfer, etc.
- the video decoder (or decoder) 30 is configured to receive the encoded image data 21 and provide decoded image data (or decoded image data) 31, wherein the decoded image data may also be referred to as reconstructed image data, a reconstructed frame of a video frame, or a video frame.
- decoded image data may also be referred to as reconstructed image data, a reconstructed frame of a video frame, or a video frame.
- Other names and the like refer to image data obtained by performing a decompression operation based on the encoded image data 21 .
- the post-processor 32 is configured to perform post-processing on the decoded image data 31 such as the decoded image to obtain post-processed image data 33 such as the post-processed image.
- Post-processing performed by post-processor 32 may include, for example, color format conversion (eg, from YCbCr to RGB), toning, trimming, or resampling, or any other processing used to generate decoded image data 31 for display by display device 34, etc. .
- a display device 34 is used to receive post-processed image data 33 to display the image to a user or viewer or the like.
- Display device 34 may be or include any type of display for representing the reconstructed image, eg, an integrated or external display screen or display.
- the display screen may include a liquid crystal display (LCD), an organic light emitting diode (OLED) display, a plasma display, a projector, a micro LED display, a liquid crystal on silicon (LCoS) display ), digital light processor (DLP), or any other type of display.
- the video encoding and decoding system 10 further includes a training engine 25, and the training engine 25 is used to train the neural network in the encoder 20 or the decoder 30, that is, the first neural network, the second Three neural network and fourth neural network.
- the training data may be stored in a database (not shown), and the training engine 25 trains the neural network based on the training data. It should be noted that the embodiments of the present application do not limit the source of the training data, for example, the training data may be obtained from the cloud or other places to perform neural network training.
- the neural network trained by the training engine 25 can be applied to the video codec system 10 and the video codec system 40, for example, applied to the source device 12 (such as the encoder 20) or the destination device 14 (such as the decoder shown in FIG. 13 ) 30).
- the training engine 25 can train the above-mentioned neural network in the cloud, and then the video encoding and decoding system 10 downloads and uses the neural network from the cloud.
- FIG. 13 shows the source device 12 and the destination device 14 as independent devices
- the device embodiment may also include the source device 12 and the destination device 14 or the functions of the source device 12 and the destination device 14 at the same time, that is, the source device 12 and the destination device 14 at the same time.
- Device 12 or corresponding function and destination device 14 or corresponding function In these embodiments, source device 12 or corresponding functionality and destination device 14 or corresponding functionality may be implemented using the same hardware and/or software or by separate hardware and/or software, or any combination thereof.
- the existence and (exact) division of the different units or functions in the source device 12 and/or the destination device 14 shown in FIG. 13 may vary based on the actual device and application, as will be apparent to the skilled person .
- FIG. 14 is another system architecture diagram of a video encoding and decoding system provided by an embodiment of the present application.
- the encoder 20 eg, the video encoder 20
- the decoder 30 eg, the video encoder 20
- the decoder 30 eg, the video The decoder 30
- a processing circuit as shown in FIG. 14, such as one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (application-specific integrated circuits) , ASIC), field programmable gate array (field-programmable gate array, FPGA), discrete logic, hardware, special purpose processors for video encoding, or any combination thereof.
- DSPs digital signal processors
- ASIC application-specific integrated circuits
- FPGA field programmable gate array
- Encoder 20 may be implemented by processing circuitry 46 to include the various modules discussed with reference to encoder 20 of FIG. 14 and/or any other decoder system or subsystem described herein.
- Decoder 30 may be implemented by processing circuit 46 to include the various modules discussed with reference to decoder 30 of FIG. 15 and/or any other decoder system or subsystem described herein.
- the processing circuitry 46 may be used to perform various operations discussed below. As shown in Figure 16, if parts of the techniques are implemented in software, a device may store the instructions of the software in a suitable non-transitory computer-readable storage medium and execute the instructions in hardware using one or more processors, thereby Implement the techniques of this application.
- One of the video encoder 20 and the video decoder 30 may be integrated in a single device as part of a combined codec (encoder/decoder, CODEC), as shown in FIG. 14 .
- Source device 12 and destination device 14 may include any of a variety of devices, including any type of handheld or stationary device, such as a laptop or laptop, cell phone, smartphone, tablet or tablet, camera, Desktop computers, set-top boxes, televisions, display devices, digital media players, video game consoles, video streaming devices (eg, content service servers or content distribution servers), broadcast receiving equipment, broadcast transmitting equipment, etc., and may not Use or use any type of operating system.
- source device 12 and destination device 14 may be equipped with components for wireless communication.
- source device 12 and destination device 14 may be wireless communication devices.
- the video codec system 10 shown in FIG. 13 is merely exemplary, and the techniques provided in this application may be applicable to video encoding settings (eg, video encoding or video decoding) that do not necessarily include encoding devices and Decode any data communication between devices.
- data is retrieved from local storage, sent over a network, and so on.
- the video encoding device may encode and store the data in memory, and/or the video decoding device may retrieve and decode the data from the memory.
- encoding and decoding are performed by devices that do not communicate with each other but merely encode data to and/or retrieve and decode data from memory.
- Video codec system 40 may include imaging device 41, video encoder 20, video decoder 30 (and/or video codec implemented by processing circuit 46), antenna 42, one or more processors 43, a or multiple memory stores 44 and/or display devices 45 .
- the imaging device 41, the antenna 42, the processing circuit 46, the video encoder 20, the video decoder 30, the processor 43, the memory storage 44 and/or the display device 45 can communicate with each other.
- video codec system 40 may include only video encoder 20 or only video decoder 30 .
- antenna 42 may be used to transmit or receive an encoded bitstream of video data.
- display device 45 may be used to present video data.
- Processing circuitry 46 may include application-specific integrated circuit (ASIC) logic, graphics processors, general purpose processors, and the like.
- the video codec system 40 may also include an optional processor 43, which may similarly include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, and the like.
- the memory memory 44 may be any type of memory, such as volatile memory (eg, static random access memory (SRAM), dynamic random access memory (DRAM), etc.) or non-volatile memory volatile memory (eg, flash memory, etc.), etc.
- memory storage 44 may be implemented by cache memory.
- processing circuitry 46 may include memory (eg, cache memory, etc.) for implementing image buffers, and the like.
- video encoder 20 implemented by logic circuitry may include an image buffer (eg, implemented by processing circuitry 46 or memory memory 44 ) and a graphics processing unit (eg, implemented by processing circuitry 46 ).
- the graphics processing unit may be communicatively coupled to the image buffer.
- the graphics processing unit may include video encoder 20 implemented by processing circuitry 46 to implement the various modules discussed with reference to video decoder 20 of FIG. 14 and/or any other encoder systems or subsystems described herein.
- Logic circuits may be used to perform the various operations discussed herein.
- video decoder 30 may be implemented by processing circuitry 46 in a similar manner to implement various of the types discussed with reference to video decoder 30 of FIG. 14 and/or any other decoder systems or subsystems described herein module.
- logic circuit-implemented video decoder 30 may include an image buffer (implemented by processing circuit 46 or memory memory 44) and a graphics processing unit (eg, implemented by processing circuit 46).
- the graphics processing unit may be communicatively coupled to the image buffer.
- the graphics processing unit may include video decoder 30 implemented by processing circuitry 46 .
- antenna 42 may be used to receive an encoded bitstream of video data.
- the encoded bitstream may include data, indicators, index values, mode selection data, etc., as discussed herein related to encoded video frames, such as data related to encoded partitions (eg, transform coefficients or quantized transform coefficients). , (as discussed) optional indicators, and/or data defining the encoding split).
- Video codec system 40 may also include video decoder 30 coupled to antenna 42 for decoding the encoded bitstream.
- Display device 45 is used to present video frames.
- video decoder 30 may be used to perform the opposite process.
- video decoder 30 may be operable to receive and parse such syntax elements, decoding the associated video data accordingly.
- video encoder 20 may entropy encode the syntax elements into an encoded video bitstream. In such instances, video decoder 30 may parse such syntax elements and decode related video data accordingly.
- codec process described in this application exists in most video codecs, such as H.263, H.264, MPEG-2, MPEG-4, VP8, VP9, AI-based end-to-end In the corresponding codec such as the image encoding of the terminal.
- FIG. 15 is a schematic diagram of a video decoding apparatus 400 provided by an embodiment of the present application.
- Video coding apparatus 400 is suitable for implementing the disclosed embodiments described herein.
- the video coding apparatus 400 may be a decoder, such as the video decoder 30 in FIG. 14 , or an encoder, such as the video encoder 20 in FIG. 14 .
- the video decoding apparatus 400 includes: an input port 410 (or input port 410) for receiving data and a receiver unit (receiver unit, Rx) 420; a processor, a logic unit or a central processing unit (central processing unit) for processing data , CPU) 430; for example, the processor 430 here can be a neural network processor 430; a transmitter unit (transmitter unit, Tx) 440 for transmitting data and an output port 450 (or output port 450); memory 460.
- the video coding apparatus 400 may also include optical-to-electrical (OE) components and electrical-to-optical (EO) components coupled to the input port 410, the receiving unit 420, the transmitting unit 440, and the output port 450, Exit or entrance for optical or electrical signals.
- OE optical-to-electrical
- EO electrical-to-optical
- the processor 430 is implemented by hardware and software.
- Processor 430 may be implemented as one or more processor chips, cores (eg, multi-core processors), FPGAs, ASICs, and DSPs.
- the processor 430 communicates with the ingress port 410 , the receiving unit 420 , the sending unit 440 , the egress port 450 and the memory 460 .
- the processor 430 includes a decoding module 470 (eg, a neural network NN based decoding module 470).
- the decoding module 470 implements the embodiments disclosed above. For example, the transcoding module 470 performs, processes, prepares or provides various encoding operations.
- decoding module 470 is implemented as instructions stored in memory 460 and executed by processor 430 .
- Memory 460 includes one or more magnetic disks, tape drives, and solid-state drives, and may serve as an overflow data storage device for storing programs when such programs are selected for execution, and for storing instructions and data read during program execution.
- Memory 460 may be volatile and/or non-volatile, and may be read-only memory (ROM), random access memory (RAM), ternary content addressable memory (ternary) content-addressable memory, TCAM) and/or static random-access memory (SRAM).
- FIG. 16 is a simplified block diagram of an apparatus 500 provided by an exemplary embodiment.
- the apparatus 500 can be used as either or both of the source device 12 and the destination device 14 in FIG. 13 .
- the processor 502 in the apparatus 500 may be a central processing unit.
- the processor 502 may be any other type of device or devices, existing or to be developed in the future, capable of manipulating or processing information.
- the disclosed implementations may be implemented using a single processor, such as processor 502 as shown, using more than one processor is faster and more efficient.
- the memory 504 in the apparatus 500 may be a read only memory (ROM) device or a random access memory (RAM) device. Any other suitable type of storage device may be used as memory 504 .
- Memory 504 may include code and data 506 accessed by processor 502 via bus 512 .
- the memory 504 may also include an operating system 508 and application programs 510 including at least one program that allows the processor 502 to perform the methods described herein.
- applications 510 may include applications 1 through N, and also include video coding applications that perform the methods described herein.
- Apparatus 500 may also include one or more output devices, such as display 518 .
- display 518 may be a touch-sensitive display that combines a display with touch-sensitive elements that may be used to sense touch input.
- Display 518 may be coupled to processor 502 through bus 512 .
- bus 512 in device 500 is described herein as a single bus, bus 512 may include multiple buses. Additionally, secondary storage may be directly coupled to other components of device 500 or accessed through a network, and may include a single integrated unit, such as a memory card, or multiple units, such as multiple memory cards. Accordingly, apparatus 500 may have various configurations.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Description
Claims (21)
- 一种视频帧的压缩方法,其特征在于,所述方法包括:根据网络选择策略从多个神经网络中确定目标神经网络,所述多个神经网络包括第一神经网络和第二神经网络;通过所述目标神经网络对当前视频帧进行压缩编码,以得到与所述当前视频帧对应的压缩信息;其中,若所述压缩信息通过所述第一神经网络得到,则所述压缩信息包括所述当前视频帧的第一特征的第一压缩信息,所述当前视频帧的参考帧用于所述当前视频帧的第一特征的压缩过程;若所述压缩信息通过所述第二神经网络得到,则所述压缩信息包括所述当前视频帧的第二特征的第二压缩信息,所述当前视频帧的参考帧用于当前视频帧的第二特征的生成过程。
- 根据权利要求1所述的方法,其特征在于,所述第一神经网络包括编码Encoding网络和熵编码层,其中,通过编码网络从所述当前视频帧中获取所述当前视频帧的第一特征;通过熵编码层利用所述当前视频帧的参考帧执行所述当前视频帧的第一特征的熵编码过程,以输出所述第一压缩信息;和/或所述第二神经网络包括卷积网络和熵编码层,卷积网络包括多个卷积层和激励ReLU层,其中,通过卷积网络利用所述当前视频帧的参考帧得到所述当前视频帧的残差,通过所述熵编码层对所述当前视频帧的残差进行熵编码处理,以输出所述第二压缩信息,其中所述第二特征为所述残差。
- 根据权利要求1或2所述的方法,其特征在于,所述网络选择策略与如下任一种或多种因素相关:所述当前视频帧的位置信息或所述当前视频帧所携带的数据量。
- 根据权利要求3所述的方法,其特征在于,所述根据网络选择策略从多个神经网络中确定目标神经网络,包括:根据所述当前视频帧在所述当前视频序列中的位置信息,从所述多个神经网络中选取所述目标神经网络,所述位置信息用于指示所述当前视频帧为所述当前视频序列的第X帧;或者,所述根据网络选择策略从多个神经网络中确定目标神经网络,包括:根据所述当前视频帧的属性,从所述多个神经网络中选取所述目标神经网络,其中,所述当前视频帧的属性用于指示所述当前视频帧所携带的数据量,所述当前视频帧的属性包括以下中的任一种或多种的组合:所述当前视频帧的熵、对比度和饱和度。
- 根据权利要求1至4中任一项所述的方法,其特征在于,所述方法还包括:生成并发送与所述压缩信息对应的指示信息,所述指示信息用于指示所述压缩信息通过所述第一神经网络和所述第二神经网络中的所述目标神经网络得到。
- 根据权利要求1至5中任一项所述的方法,其特征在于,若所述目标神经网络为所述第一神经网络,所述通过所述目标神经网络对当前视频帧进行压缩编码,以得到与所述 当前视频帧对应的压缩信息,包括:通过编码网络从所述当前视频帧中获取所述当前视频帧的第一特征;通过熵编码层根据所述当前视频帧的参考帧,对所述当前视频帧的第一特征进行预测,以生成所述当前视频帧的预测特征,所述当前视频帧的预测特征为所述当前视频帧的第一特征的预测结果;通过熵编码层根据所述当前视频帧的预测特征,生成所述当前视频帧的第一特征的概率分布;通过熵编码层根据所述当前视频帧的第一特征的概率分布,对所述当前视频帧的第一特征进行熵编码,得到所述第一压缩信息。
- 一种视频帧的压缩方法,其特征在于,所述方法包括:通过第一神经网络对当前视频帧进行压缩编码,以得到当前视频帧的第一特征的第一压缩信息,所述当前视频帧的参考帧用于所述当前视频帧的第一特征的压缩过程;通过所述第一神经网络生成第一视频帧,所述第一视频帧为所述当前视频帧的重建帧;通过第二神经网络对所述当前视频帧进行压缩编码,以得到所述当前视频帧的第二特征的第二压缩信息,所述当前视频帧的参考帧用于当前视频帧的第二特征的生成过程;通过所述第二神经网络生成第二视频帧,所述第二视频帧为所述当前视频帧的重建帧;根据所述第一压缩信息、所述第一视频帧、所述第二压缩信息和所述第二视频帧,确定与所述当前视频帧对应的压缩信息,其中,所述确定的压缩信息是通过所述第一神经网络得到的,所述确定的压缩信息为所述第一压缩信息;或者,所述确定的压缩信息是通过所述第二神经网络得到的,所述确定的压缩信息为所述第二压缩信息。
- 根据权利要求7所述的方法,其特征在于,所述第一神经网络包括编码Encoding网络和熵编码层,其中,通过编码网络从所述当前视频帧中获取所述当前视频帧的第一特征,通过熵编码层对所述当前视频帧的第一特征进行熵编码,以输出所述第一压缩信息;和/或所述第二神经网络包括卷积网络和熵编码层,卷积网络包括多个卷积层和激励ReLU层,其中,通过卷积网络利用所述当前视频帧的参考帧得到所述当前视频帧的残差,通过熵编码层对所述当前视频帧的残差进行熵编码处理,以输出所述第二压缩信息。
- 一种视频帧的压缩方法,其特征在于,所述方法包括:通过第一神经网络对第三视频帧进行压缩编码,以得到与所述第三视频帧对应的第一压缩信息,所述第一压缩信息包括所述第三视频帧的第一特征的压缩信息,所述第三视频帧的参考帧用于所述第三视频帧的第一特征的压缩过程;通过第二神经网络对第四视频帧进行压缩编码,以得到与所述第四视频帧对应的第二压缩信息,所述第二压缩信息包括所述第四视频帧的第二特征的压缩信息,所述第四视频帧的参考帧用于所述第四视频帧的第二特征的生成过程,所述第三视频帧和所述第四视频帧为同一视频序列中不同的视频帧。
- 根据权利要求9所述的方法,其特征在于,所述第一神经网络包括编码Encoding网络和熵编码层,其中,通过编码网络从所述当前视频帧中获取所述当前视频帧的第一特征,通过熵编码层对所述当前视频帧的第一特征进行熵编码,以输出所述第一压缩信息;和/或所述第二神经网络包括卷积网络和熵编码层,卷积网络包括多个卷积层和激励ReLU层,其中,通过卷积网络利用所述当前视频帧的参考帧得到所述当前视频帧的残差,通过熵编码层对所述当前视频帧的残差进行熵编码处理,以输出所述第二压缩信息。
- 一种视频帧的解压缩方法,其特征在于,所述方法包括:获取当前视频帧的压缩信息;从多个神经网络中选择与所述当前视频帧对应的目标神经网络,所述多个神经网络包括第三神经网络和第四神经网络;根据所述压缩信息,通过所述目标神经网络执行解压缩操作,以得到所述当前视频帧的重建帧;其中,若所述目标神经网络为所述第三神经网络,则所述压缩信息包括所述当前视频帧的第一特征的第一压缩信息,所述当前视频帧的参考帧用于所述第一压缩信息的解压缩过程,以得到所述当前视频帧的第一特征,所述当前视频帧的第一特征用于所述当前视频帧的重建帧的生成过程;若所述目标神经网络为所述第四神经网络,则所述压缩信息包括所述当前视频帧的第二特征的第二压缩信息,所述第二压缩信息用于供所述解码器执行解压缩操作以得到所述当前视频帧的第二特征,所述当前视频帧的参考帧和所述当前视频帧的第二特征用于所述当前视频帧的重建帧的生成过程。
- 根据权利要求11所述的方法,其特征在于,所述第三神经网络包括熵解码层和解码Decoding网络,其中,通过熵解码层利用所述当前视频帧的参考帧执行所述当前视频帧的第一压缩信息的熵解码过程,通过解码网络利用所述当前视频帧的第一特征生成所述当前视频帧的重建帧;和/或所述第四神经网络包括熵解码层和卷积网络,其中,通过熵解码层对所述第二压缩信息进行熵解码,通过卷积网络利用所述当前视频帧的参考帧和所述当前视频帧的第二特征执行所述当前视频帧的重建帧的生成过程。
- 根据权利要求11或12所述的方法,其特征在于,所述方法还包括:获取与所述压缩信息对应的指示信息;所述从多个神经网络中选择与所述当前视频帧对应的目标神经网络,包括:根据所述指示信息,从所述多个神经网络中确定所述目标神经网络。
- 一种视频帧的解压缩方法,其特征在于,所述方法包括:通过第三神经网络对第三视频帧的第一压缩信息进行解压缩,以得到所述第三视频帧的重建帧,所述第一压缩信息包括所述第三视频帧的第一特征的压缩信息,所述第三视频帧的参考帧用于所述第一压缩信息的解压缩过程,以得到所述第三视频帧的第一特征,所 述第三视频帧的第一特征用于所述第三视频帧的重建帧的生成过程;通过第四神经网络对第四视频帧的第二压缩信息进行解压缩,以得到所述第四视频帧的重建帧,所述第二压缩信息包括所述第四视频帧的第二特征的压缩信息,所述第二压缩信息用于供所述解码器执行解压缩操作以得到所述第四视频帧的第二特征,所述第四视频帧的参考帧和所述第四视频帧的第二特征用于所述第四视频帧的重建帧的生成过程。
- 根据权利要求14所述的方法,其特征在于,所述第三神经网络包括熵解码层和解码Decoding网络,其中,通过熵解码层利用所述当前视频帧的参考帧执行所述当前视频帧的第一压缩信息的熵解码过程,通过解码网络利用所述当前视频帧的第一特征生成所述当前视频帧的重建帧;和/或所述第四神经网络包括熵解码层和卷积网络,其中,通过熵解码层对所述第二压缩信息进行熵解码,通过卷积网络利用所述当前视频帧的参考帧和所述当前视频帧的第二特征执行所述当前视频帧的重建帧的生成过程。
- 一种编码器,其特征在于,包括处理电路,用于执行如权利要求1-10任一项所述的方法。
- 一种解码器,其特征在于,包括处理电路,用于执行如权利要求11-15任一项所述的方法。
- 一种计算机程序产品,其特征在于,包括程序代码,当所述程序代码在计算机或处理器上执行时,用于执行如权利要求1-15任一项所述的方法。
- 一种编码器,其特征在于,包括:一个或多个处理器;非瞬时性计算机可读存储介质,耦合到所述处理器,存储有所述处理器执行的程序指令,其中,所述程序指令在由所述处理器执行时,使得所述编码器执行如权利要求1-10任一项所述的方法。
- 一种解码器,其特征在于,包括:一个或多个处理器;非瞬时性计算机可读存储介质,耦合到所述处理器,存储有所述处理器执行的程序指令,其中,所述程序指令在由所述处理器执行时,使得所述解码器执行如权利要求11-15任一项所述的方法。
- 一种非瞬时性计算机可读存储介质,其特征在于,包括程序代码,当所述程序代码由计算机设备执行时,用于执行基于权利要求1-15任一项所述的方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP21890702.0A EP4231644A4 (en) | 2020-11-13 | 2021-08-11 | METHOD AND APPARATUS FOR VIDEO FRAME COMPRESSION AND METHOD AND APPARATUS FOR VIDEO FRAME DECOMPRESSION |
JP2023528362A JP2023549210A (ja) | 2020-11-13 | 2021-08-11 | ビデオフレーム圧縮方法、ビデオフレーム伸長方法及び装置 |
CN202180076647.0A CN116918329A (zh) | 2020-11-13 | 2021-08-11 | 一种视频帧的压缩和视频帧的解压缩方法及装置 |
US18/316,750 US20230281881A1 (en) | 2020-11-13 | 2023-05-12 | Video Frame Compression Method, Video Frame Decompression Method, and Apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011271217.8 | 2020-11-13 | ||
CN202011271217.8A CN114501031B (zh) | 2020-11-13 | 2020-11-13 | 一种压缩编码、解压缩方法以及装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US18/316,750 Continuation US20230281881A1 (en) | 2020-11-13 | 2023-05-12 | Video Frame Compression Method, Video Frame Decompression Method, and Apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022100173A1 true WO2022100173A1 (zh) | 2022-05-19 |
Family
ID=81491074
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/107500 WO2022100140A1 (zh) | 2020-11-13 | 2021-07-21 | 一种压缩编码、解压缩方法以及装置 |
PCT/CN2021/112077 WO2022100173A1 (zh) | 2020-11-13 | 2021-08-11 | 一种视频帧的压缩和视频帧的解压缩方法及装置 |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/107500 WO2022100140A1 (zh) | 2020-11-13 | 2021-07-21 | 一种压缩编码、解压缩方法以及装置 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20230281881A1 (zh) |
EP (1) | EP4231644A4 (zh) |
JP (1) | JP2023549210A (zh) |
CN (2) | CN114501031B (zh) |
WO (2) | WO2022100140A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115529457B (zh) * | 2022-09-05 | 2024-05-14 | 清华大学 | 基于深度学习的视频压缩方法和装置 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107105278A (zh) * | 2017-04-21 | 2017-08-29 | 中国科学技术大学 | 运动矢量自动生成的视频编解码框架 |
CN107172428A (zh) * | 2017-06-06 | 2017-09-15 | 西安万像电子科技有限公司 | 图像的传输方法、装置和系统 |
US20190124346A1 (en) * | 2017-10-19 | 2019-04-25 | Arizona Board Of Regents On Behalf Of Arizona State University | Real time end-to-end learning system for a high frame rate video compressive sensing network |
CN110401836A (zh) * | 2018-04-25 | 2019-11-01 | 杭州海康威视数字技术股份有限公司 | 一种图像解码、编码方法、装置及其设备 |
CN110913220A (zh) * | 2019-11-29 | 2020-03-24 | 合肥图鸭信息科技有限公司 | 一种视频帧编码方法、装置及终端设备 |
US20200236349A1 (en) * | 2019-01-22 | 2020-07-23 | Apple Inc. | Predictive coding with neural networks |
CN111447449A (zh) * | 2020-04-01 | 2020-07-24 | 北京奥维视讯科技有限责任公司 | 基于roi的视频编码方法和系统以及视频传输和编码系统 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2646575A1 (fr) * | 1989-04-26 | 1990-11-02 | Labo Electronique Physique | Procede et structure pour la compression de donnees |
KR102124714B1 (ko) * | 2015-09-03 | 2020-06-19 | 미디어텍 인크. | 비디오 코딩에서의 신경망 기반 프로세싱의 방법 및 장치 |
US11593632B2 (en) * | 2016-12-15 | 2023-02-28 | WaveOne Inc. | Deep learning based on image encoding and decoding |
CN107197260B (zh) * | 2017-06-12 | 2019-09-13 | 清华大学深圳研究生院 | 基于卷积神经网络的视频编码后置滤波方法 |
CN107396124B (zh) * | 2017-08-29 | 2019-09-20 | 南京大学 | 基于深度神经网络的视频压缩方法 |
CN111641832B (zh) * | 2019-03-01 | 2022-03-25 | 杭州海康威视数字技术股份有限公司 | 编码方法、解码方法、装置、电子设备及存储介质 |
CN111083494A (zh) * | 2019-12-31 | 2020-04-28 | 合肥图鸭信息科技有限公司 | 一种视频编码方法、装置及终端设备 |
CN111263161B (zh) * | 2020-01-07 | 2021-10-26 | 北京地平线机器人技术研发有限公司 | 视频压缩处理方法、装置、存储介质和电子设备 |
-
2020
- 2020-11-13 CN CN202011271217.8A patent/CN114501031B/zh active Active
-
2021
- 2021-07-21 WO PCT/CN2021/107500 patent/WO2022100140A1/zh active Application Filing
- 2021-08-11 EP EP21890702.0A patent/EP4231644A4/en active Pending
- 2021-08-11 JP JP2023528362A patent/JP2023549210A/ja active Pending
- 2021-08-11 CN CN202180076647.0A patent/CN116918329A/zh active Pending
- 2021-08-11 WO PCT/CN2021/112077 patent/WO2022100173A1/zh active Application Filing
-
2023
- 2023-05-12 US US18/316,750 patent/US20230281881A1/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107105278A (zh) * | 2017-04-21 | 2017-08-29 | 中国科学技术大学 | 运动矢量自动生成的视频编解码框架 |
CN107172428A (zh) * | 2017-06-06 | 2017-09-15 | 西安万像电子科技有限公司 | 图像的传输方法、装置和系统 |
US20190124346A1 (en) * | 2017-10-19 | 2019-04-25 | Arizona Board Of Regents On Behalf Of Arizona State University | Real time end-to-end learning system for a high frame rate video compressive sensing network |
CN110401836A (zh) * | 2018-04-25 | 2019-11-01 | 杭州海康威视数字技术股份有限公司 | 一种图像解码、编码方法、装置及其设备 |
US20200236349A1 (en) * | 2019-01-22 | 2020-07-23 | Apple Inc. | Predictive coding with neural networks |
CN110913220A (zh) * | 2019-11-29 | 2020-03-24 | 合肥图鸭信息科技有限公司 | 一种视频帧编码方法、装置及终端设备 |
CN111447449A (zh) * | 2020-04-01 | 2020-07-24 | 北京奥维视讯科技有限责任公司 | 基于roi的视频编码方法和系统以及视频传输和编码系统 |
Non-Patent Citations (1)
Title |
---|
See also references of EP4231644A4 |
Also Published As
Publication number | Publication date |
---|---|
CN114501031B (zh) | 2023-06-02 |
US20230281881A1 (en) | 2023-09-07 |
CN114501031A (zh) | 2022-05-13 |
EP4231644A4 (en) | 2024-03-20 |
WO2022100140A1 (zh) | 2022-05-19 |
CN116918329A (zh) | 2023-10-20 |
JP2023549210A (ja) | 2023-11-22 |
EP4231644A1 (en) | 2023-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3571841B1 (en) | Dc coefficient sign coding scheme | |
TWI806199B (zh) | 特徵圖資訊的指示方法,設備以及電腦程式 | |
US10506258B2 (en) | Coding video syntax elements using a context tree | |
TW202228081A (zh) | 用於從位元流重建圖像及用於將圖像編碼到位元流中的方法及裝置、電腦程式產品 | |
US11558619B2 (en) | Adaptation of scan order for entropy coding | |
US20230362378A1 (en) | Video coding method and apparatus | |
US20230209096A1 (en) | Loop filtering method and apparatus | |
WO2023279961A1 (zh) | 视频图像的编解码方法及装置 | |
US20240105193A1 (en) | Feature Data Encoding and Decoding Method and Apparatus | |
WO2022063267A1 (zh) | 帧内预测方法及装置 | |
WO2022100173A1 (zh) | 一种视频帧的压缩和视频帧的解压缩方法及装置 | |
US20230396810A1 (en) | Hierarchical audio/video or picture compression method and apparatus | |
WO2023193629A1 (zh) | 区域增强层的编解码方法和装置 | |
WO2022194137A1 (zh) | 视频图像的编解码方法及相关设备 | |
CN114554205B (zh) | 一种图像编解码方法及装置 | |
KR20230145096A (ko) | 신경망 기반 픽처 프로세싱에서의 보조 정보의 독립적 위치결정 | |
CN110731082B (zh) | 使用反向排序来压缩视频帧组 | |
WO2023279968A1 (zh) | 视频图像的编解码方法及装置 | |
WO2024007820A1 (zh) | 数据编解码方法及相关设备 | |
WO2022140937A1 (zh) | 点云编解码方法与系统、及点云编码器与点云解码器 | |
WO2023091040A1 (en) | Generalized difference coder for residual coding in video compression | |
KR20240064698A (ko) | 특징 맵 인코딩 및 디코딩 방법 및 장치 | |
WO2023059689A1 (en) | Systems and methods for predictive coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21890702 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202180076647.0 Country of ref document: CN Ref document number: 2023528362 Country of ref document: JP |
|
ENP | Entry into the national phase |
Ref document number: 2021890702 Country of ref document: EP Effective date: 20230516 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |