WO2023050433A1 - 视频编解码方法、编码器、解码器及存储介质 - Google Patents

视频编解码方法、编码器、解码器及存储介质 Download PDF

Info

Publication number
WO2023050433A1
WO2023050433A1 PCT/CN2021/122473 CN2021122473W WO2023050433A1 WO 2023050433 A1 WO2023050433 A1 WO 2023050433A1 CN 2021122473 W CN2021122473 W CN 2021122473W WO 2023050433 A1 WO2023050433 A1 WO 2023050433A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
feature
feature information
intermediate layer
decoding
Prior art date
Application number
PCT/CN2021/122473
Other languages
English (en)
French (fr)
Inventor
虞露
贾可
何淇淇
王东
Original Assignee
浙江大学
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学, Oppo广东移动通信有限公司 filed Critical 浙江大学
Priority to PCT/CN2021/122473 priority Critical patent/WO2023050433A1/zh
Priority to CN202180102730.0A priority patent/CN118020306A/zh
Publication of WO2023050433A1 publication Critical patent/WO2023050433A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression

Definitions

  • the present application relates to the technical field of video coding and decoding, and in particular to a video coding and decoding method, an encoder, a decoder, and a storage medium.
  • Digital video technology can be incorporated into a variety of video devices, such as digital televisions, smartphones, computers, e-readers, or video players, among others.
  • video devices implement video compression technology to enable more effective transmission or storage of video data.
  • Embodiments of the present application provide a video encoding and decoding method, an encoder, a decoder, and a storage medium, so as to save the time and calculation amount of task analysis, and further improve the efficiency of task analysis.
  • the present application provides a video coding method, including:
  • the embodiment of the present application provides a video decoding method, including:
  • the encoding network and the decoding network perform end-to-end training together, and the first feature information output by the i-th intermediate layer of the decoding network is input into the j-th intermediate layer of the task analysis network.
  • the present application provides a video encoder, configured to execute the method in the above first aspect or various implementations thereof.
  • the encoder includes a functional unit configured to execute the method in the above first aspect or its implementations.
  • the present application provides a video decoder, configured to execute the method in the above second aspect or various implementations thereof.
  • the decoder includes a functional unit configured to execute the method in the above second aspect or its various implementations.
  • a video encoder including a processor and a memory.
  • the memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory, so as to execute the method in the above first aspect or its various implementations.
  • a sixth aspect provides a video decoder, including a processor and a memory.
  • the memory is used to store a computer program
  • the processor is used to invoke and run the computer program stored in the memory, so as to execute the method in the above second aspect or its various implementations.
  • a video codec system including a video encoder and a video decoder.
  • the video encoder is configured to execute the method in the first aspect or its implementations
  • the video decoder is configured to execute the method in the second aspect or its implementations.
  • the chip includes: a processor, configured to call and run a computer program from the memory, so that the device installed with the chip executes any one of the above-mentioned first to second aspects or any of the implementations thereof. method.
  • a computer-readable storage medium for storing a computer program, and the computer program causes a computer to execute any one of the above-mentioned first to second aspects or the method in each implementation manner thereof.
  • a computer program product including computer program instructions, the computer program instructions cause a computer to execute any one of the above first to second aspects or the method in each implementation manner.
  • a computer program which, when running on a computer, causes the computer to execute any one of the above-mentioned first to second aspects or the method in each implementation manner thereof.
  • a twelfth aspect provides a code stream, which is generated by the method described in the second aspect above.
  • the first feature information output by the i-th intermediate layer of the decoding network is obtained, where i is a positive integer; the first feature information is input into the task analysis network.
  • the task analysis results output by the task analysis network are obtained, and j is a positive integer.
  • the feature information output by the middle layer of the decoding network is input into the task analysis network, so that the task analysis network performs task analysis based on the feature information output by the decoding network, which saves the time and computing resources occupied by task analysis, and improves the efficiency of task analysis. .
  • FIG. 1 is a schematic block diagram of a video encoding and decoding system involved in an embodiment of the present application
  • Fig. 2 is a schematic flow chart of image compression and analysis
  • FIG. 3 is a schematic block diagram of a video encoder involved in an embodiment of the present application.
  • Fig. 4 is a schematic block diagram of a video decoder involved in an embodiment of the present application.
  • FIG. 5A is a schematic flow diagram of an end-to-end codec network
  • FIG. 5B is a schematic diagram of division of Cheng codec network
  • FIG. 5C is a schematic diagram of the division of the Lee encoding and decoding network
  • 5D is a schematic diagram of the division of the Hu codec network
  • FIG. 6A is a schematic flow diagram of a task analysis network
  • 6B is a schematic diagram of the division of the target recognition network
  • FIG. 6C is a schematic diagram of the division of the target detection network
  • FIG. 7 is a schematic flowchart of a video decoding method provided in an embodiment of the present application.
  • FIG. 8A is a schematic diagram of a network model involved in an embodiment of the present application.
  • FIG. 8B is a schematic diagram of a decoding network involved in an embodiment of the present application.
  • FIG. 8C is a schematic diagram of another decoding network involved in an embodiment of the present application.
  • FIG. 8D is a schematic diagram of another decoding network involved in an embodiment of the present application.
  • FIG. 9A is a schematic diagram of a target detection network involved in an embodiment of the present application.
  • Fig. 9B is a schematic diagram of a part of the network in Fig. 9A;
  • Figure 9C is a network schematic diagram of some components in the Cheng2020 network
  • Figure 9D is a network example diagram of an end-to-end encoding and decoding network and a task analysis network
  • FIG. 9E is a schematic diagram of a connection between an end-to-end codec network and a task analysis network
  • FIG. 9F is another schematic diagram of connection between the end-to-end codec network and the task analysis network
  • Fig. 9G is another connection diagram of the end-to-end codec network and the task analysis network
  • FIG. 9H is a schematic diagram of another model involved in the embodiment of the present application.
  • FIG. 10A is a schematic diagram of a connection between an end-to-end codec network and a task analysis network
  • Fig. 10B is another schematic diagram of connection between the end-to-end codec network and the task analysis network;
  • Fig. 10C is another schematic diagram of the connection between the end-to-end codec network and the task analysis network;
  • FIG. 11 is a schematic flowchart of a video decoding method provided by an embodiment of the present application.
  • Fig. 12 is a schematic structural diagram of a decoding network and a task analysis network involved in an embodiment of the present application
  • FIG. 13 is a schematic flowchart of a video encoding method provided by an embodiment of the present application.
  • FIG. 14A is a schematic structural diagram of an encoding network involved in the present application.
  • FIG. 14B is another schematic structural diagram of the encoding network involved in the present application.
  • FIG. 14C is another schematic structural diagram of the encoding network involved in this application.
  • Fig. 14D is a schematic diagram of a model of the encoding network involved in the present application.
  • Fig. 14E is a network schematic diagram of an attention model in an encoding network
  • Figure 15 is a schematic diagram of a general-purpose end-to-end codec network
  • Fig. 16 is a schematic block diagram of a video decoder provided by an embodiment of the present application.
  • Fig. 17 is a schematic block diagram of a video encoder provided by an embodiment of the present application.
  • Fig. 18 is a schematic block diagram of an electronic device provided by an embodiment of the present application.
  • Fig. 19 is a schematic block diagram of a video codec system provided by an embodiment of the present application.
  • This application can be applied to various video coding and decoding fields for machine vision and human-machine hybrid vision, combining technologies such as 5G, AI, deep learning, feature extraction and video analysis with existing video processing and coding technologies.
  • the 5G era has spawned a large number of machine-oriented applications, such as Internet of Vehicles, driverless driving, industrial Internet, smart and safe cities, wearables, video surveillance and other machine vision content. Compared with the increasingly saturated human-oriented video, the application scenarios are more extensive. , Video coding for machine vision will become one of the main sources of incremental traffic in the 5G and post-5G era.
  • the solution of the present application can be combined with audio and video coding standards (audio video coding standard, referred to as AVS), for example, H.264/audio video coding (audio video coding, referred to as AVC) standard, H.265/high efficiency video coding ( High efficiency video coding (HEVC for short) standard and H.266/versatile video coding (VVC for short) standard.
  • AVS audio video coding standard
  • AVC audio video coding
  • H.265/high efficiency video coding High efficiency video coding (HEVC for short) standard
  • VVC versatile video coding
  • the solutions of the present application may operate in conjunction with other proprietary or industry standards, including ITU-TH.261, ISO/IECMPEG-1Visual, ITU-TH.262 or ISO/IECMPEG-2Visual, ITU-TH.263 , ISO/IECMPEG-4Visual, ITU-TH.264 (also known as ISO/IECMPEG-4AVC), including scalable video codec (SVC) and multi-view video codec (MVC) extensions.
  • SVC scalable video codec
  • MVC multi-view video codec
  • FIG. 1 is a schematic block diagram of a video encoding and decoding system involved in an embodiment of the present application. It should be noted that FIG. 1 is only an example, and the video codec system in the embodiment of the present application includes but is not limited to what is shown in FIG. 1 .
  • the video codec system 100 includes an encoding device 110 and a decoding device 120 .
  • the encoding device is used to encode (can be understood as compression) the video data to generate a code stream, and transmit the code stream to the decoding device.
  • the decoding device decodes the code stream generated by the encoding device to obtain decoded video data.
  • the encoding device 110 in the embodiment of the present application can be understood as a device having a video encoding function
  • the decoding device 120 can be understood as a device having a video decoding function, that is, the embodiment of the present application includes a wider range of devices for the encoding device 110 and the decoding device 120, Examples include smartphones, desktop computers, mobile computing devices, notebook (eg, laptop) computers, tablet computers, set-top boxes, televisions, cameras, display devices, digital media players, video game consoles, vehicle-mounted computers, and the like.
  • the encoding device 110 may transmit the encoded video data (such as code stream) to the decoding device 120 via the channel 130 .
  • Channel 130 may include one or more media and/or devices capable of transmitting encoded video data from encoding device 110 to decoding device 120 .
  • channel 130 includes one or more communication media that enable encoding device 110 to transmit encoded video data directly to decoding device 120 in real-time.
  • encoding device 110 may modulate the encoded video data according to a communication standard and transmit the modulated video data to decoding device 120 .
  • the communication medium includes a wireless communication medium, such as a radio frequency spectrum.
  • the communication medium may also include a wired communication medium, such as one or more physical transmission lines.
  • the channel 130 includes a storage medium that can store video data encoded by the encoding device 110 .
  • the storage medium includes a variety of locally accessible data storage media, such as optical discs, DVDs, flash memory, and the like.
  • the decoding device 120 may acquire encoded video data from the storage medium.
  • channel 130 may include a storage server that may store video data encoded by encoding device 110 .
  • the decoding device 120 may download the stored encoded video data from the storage server.
  • the storage server may store the encoded video data and may transmit the encoded video data to the decoding device 120, such as a web server (eg, for a website), a file transfer protocol (FTP) server, and the like.
  • FTP file transfer protocol
  • the encoding device 110 includes a video encoder 112 and an output interface 113 .
  • the output interface 113 may include a modulator/demodulator (modem) and/or a transmitter.
  • the encoding device 110 may include a video source 111 in addition to the video encoder 112 and the input interface 113 .
  • the video source 111 may include at least one of a video capture device (for example, a video camera), a video archive, a video input interface, a computer graphics system, wherein the video input interface is used to receive video data from a video content provider, and the computer graphics system Used to generate video data.
  • a video capture device for example, a video camera
  • a video archive for example, a video archive
  • a video input interface for example, a video archive
  • video input interface for example, a video input interface
  • computer graphics system used to generate video data.
  • the video encoder 112 encodes the video data from the video source 111 to generate a code stream.
  • Video data may include one or more pictures or a sequence of pictures.
  • the code stream contains the encoding information of an image or image sequence in the form of a bit stream.
  • Encoding information may include encoded image data and associated data.
  • the associated data may include a sequence parameter set (SPS for short), a picture parameter set (PPS for short) and other syntax structures.
  • SPS sequence parameter set
  • PPS picture parameter set
  • An SPS may contain parameters that apply to one or more sequences.
  • a PPS may contain parameters applied to one or more images.
  • the syntax structure refers to a set of zero or more syntax elements arranged in a specified order in the code stream.
  • the video encoder 112 directly transmits encoded video data to the decoding device 120 via the output interface 113 .
  • the encoded video data can also be stored on a storage medium or a storage server for subsequent reading by the decoding device 120 .
  • the decoding device 120 includes an input interface 121 and a video decoder 122 .
  • the decoding device 120 may include a display device 123 in addition to the input interface 121 and the video decoder 122 .
  • the input interface 121 includes a receiver and/or a modem.
  • the input interface 121 can receive encoded video data through the channel 130 .
  • the video decoder 122 is used to decode the encoded video data to obtain decoded video data, and transmit the decoded video data to the display device 123 .
  • the display device 123 displays the decoded video data.
  • the display device 123 may be integrated with the decoding device 120 or external to the decoding device 120 .
  • the display device 123 may include various display devices, such as a liquid crystal display (LCD), a plasma display, an organic light emitting diode (OLED) display, or other types of display devices.
  • LCD liquid crystal display
  • plasma display a plasma display
  • OLED organic light emitting diode
  • FIG. 1 is only an example, and the technical solutions of the embodiments of the present application are not limited to FIG. 1 .
  • the technology of the present application may also be applied to one-sided video encoding or one-sided video decoding.
  • the neural network originates from the interdisciplinary research of cognitive neuroscience and mathematics.
  • the multi-layer perceptron (MLP) structure constructed by multi-layer alternate cascaded neurons and nonlinear activation functions can realize the recognition with a small enough error. Approximation of arbitrary continuous functions.
  • the learning method of the neural network has gone through the perceptron learning algorithm proposed in the 1860s, the MLP learning process established by the chain rule and the backpropagation algorithm in the 1980s, and then the stochastic gradient descent method that is widely used today.
  • LSTM long short-term memory
  • the gradient transfer is controlled by a recurrent network structure to achieve efficient learning of sequence signals.
  • Deep neural network training is made possible by hierarchically pre-training each layer of a restricted Boltzmann machine (RBM). While explaining that the MLP structure has a more excellent feature learning ability, the complexity of MLP training can also be effectively alleviated by layer-by-layer initialization and pre-training. Since then, the research on the MLP structure with multiple hidden layers has become a hot topic again, and the neural network has a new name - deep learning (deep learning, DL).
  • RBM restricted Boltzmann machine
  • Neural networks as optimization algorithms and compact representations of signals, can be combined with image and video compression.
  • end-to-end image/video codecs based on deep learning adopt deep neural network-assisted coding tools, with the help of layered model architecture and large-scale data prior information in deep learning methods , achieving better performance than conventional codecs.
  • the encoding and decoding framework methods for compression can generally be divided into: traditional hybrid encoding framework; improvement of traditional mixed encoding framework (for example, using neural network to replace some modules in the traditional framework); end-to-end encoding and decoding network framework.
  • the outputs of these compression methods are decoded and reconstructed images or videos.
  • the video encoder shown in FIG. 3 and the video decoder shown in FIG. 4 may be used.
  • Fig. 3 is a schematic block diagram of a video encoder involved in an embodiment of the present application. It should be understood that the video encoder 200 can be used to perform lossy compression on images, and can also be used to perform lossless compression on images.
  • the lossless compression may be visually lossless compression or mathematically lossless compression.
  • the video encoder 200 may include: a prediction unit 210, a residual unit 220, a transform/quantization unit 230, an inverse transform/quantization unit 240, a reconstruction unit 250, and a loop filter unit 260. Decoded image cache 270 and entropy encoding unit 280. It should be noted that the video encoder 200 may include more, less or different functional components.
  • the current block may be called a current coding unit (CU) or a current prediction unit (PU).
  • a predicted block may also be called a predicted image block or an image predicted block, and a reconstructed image block may also be called a reconstructed block or an image reconstructed image block.
  • the prediction unit 210 includes an inter prediction unit 211 and an intra estimation unit 212 . Because there is a strong correlation between adjacent pixels in a video frame, the intra-frame prediction method is used in video coding and decoding technology to eliminate the spatial redundancy between adjacent pixels. Due to the strong similarity between adjacent frames in video, the inter-frame prediction method is used in video coding and decoding technology to eliminate time redundancy between adjacent frames, thereby improving coding efficiency.
  • the inter-frame prediction unit 211 can be used for inter-frame prediction.
  • the inter-frame prediction can refer to image information of different frames.
  • the inter-frame prediction uses motion information to find a reference block from the reference frame, and generates a prediction block according to the reference block to eliminate temporal redundancy;
  • Frames used for inter-frame prediction may be P frames and/or B frames, P frames refer to forward predictive frames, and B frames refer to bidirectional predictive frames.
  • the motion information includes the reference frame list where the reference frame is located, the reference frame index, and the motion vector.
  • the motion vector can be an integer pixel or a sub-pixel. If the motion vector is sub-pixel, then it is necessary to use interpolation filtering in the reference frame to make the required sub-pixel block.
  • the reference frame found according to the motion vector A block of whole pixels or sub-pixels is called a reference block.
  • Some technologies will directly use the reference block as a prediction block, and some technologies will further process the reference block to generate a prediction block. Reprocessing and generating a prediction block based on a reference block can also be understood as taking the reference block as a prediction block and then processing and generating a new prediction block based on the prediction block.
  • the intra-frame estimation unit 212 only refers to the information of the same frame of image to predict the pixel information in the current code image block for eliminating spatial redundancy.
  • a frame used for intra prediction may be an I frame.
  • the intra prediction modes used by HEVC include planar mode (Planar), DC and 33 angle modes, a total of 35 prediction modes.
  • the intra-frame modes used by VVC include Planar, DC and 65 angle modes, with a total of 67 prediction modes.
  • a prediction matrix Matrix based intra prediction, MIP
  • MIP prediction matrix
  • CCLM CCLM prediction mode
  • the intra-frame prediction will be more accurate, and it will be more in line with the demand for the development of high-definition and ultra-high-definition digital video.
  • the residual unit 220 may generate a residual block of the CU based on the pixel blocks of the CU and the prediction blocks of the PUs of the CU. For example, residual unit 220 may generate a residual block for a CU such that each sample in the residual block has a value equal to the difference between the samples in the pixel blocks of the CU, and the samples in the PUs of the CU. Corresponding samples in the predicted block.
  • Transform/quantization unit 230 may quantize the transform coefficients. Transform/quantization unit 230 may quantize transform coefficients associated with TUs of a CU based on quantization parameter (QP) values associated with the CU. Video encoder 200 may adjust the degree of quantization applied to transform coefficients associated with a CU by adjusting the QP value associated with the CU.
  • QP quantization parameter
  • Inverse transform/quantization unit 240 may apply inverse quantization and inverse transform to the quantized transform coefficients, respectively, to reconstruct a residual block from the quantized transform coefficients.
  • the reconstruction unit 250 may add samples of the reconstructed residual block to corresponding samples of one or more prediction blocks generated by the prediction unit 210 to generate a reconstructed image block associated with the TU. By reconstructing the sample blocks of each TU of the CU in this way, the video encoder 200 can reconstruct the pixel blocks of the CU.
  • Loop filtering unit 260 may perform deblocking filtering operations to reduce blocking artifacts of pixel blocks associated with a CU.
  • the loop filtering unit 260 includes a deblocking filtering unit and a sample adaptive compensation/adaptive loop filtering (SAO/ALF) unit, wherein the deblocking filtering unit is used for deblocking, and the SAO/ALF unit Used to remove ringing effects.
  • SAO/ALF sample adaptive compensation/adaptive loop filtering
  • the decoded image buffer 270 may store reconstructed pixel blocks.
  • Inter prediction unit 211 may use reference pictures containing reconstructed pixel blocks to perform inter prediction on PUs of other pictures.
  • intra estimation unit 212 may use the reconstructed pixel blocks in decoded picture cache 270 to perform intra prediction on other PUs in the same picture as the CU.
  • Entropy encoding unit 280 may receive the quantized transform coefficients from transform/quantization unit 230 . Entropy encoding unit 280 may perform one or more entropy encoding operations on the quantized transform coefficients to generate entropy encoded data.
  • Fig. 4 is a schematic block diagram of a video decoder involved in an embodiment of the present application.
  • the video decoder 300 includes: an entropy decoding unit 310 , a prediction unit 320 , an inverse quantization transformation unit 330 , a reconstruction unit 340 , a loop filter unit 350 and a decoded image buffer 360 . It should be noted that the video decoder 300 may include more, less or different functional components.
  • the video decoder 300 can receive code streams.
  • the entropy decoding unit 310 may parse the codestream to extract syntax elements from the codestream. As part of parsing the codestream, the entropy decoding unit 310 may parse the entropy-encoded syntax elements in the codestream.
  • the prediction unit 320 , the inverse quantization transformation unit 330 , the reconstruction unit 340 and the loop filter unit 350 can decode video data according to the syntax elements extracted from the code stream, that is, generate decoded video data.
  • the prediction unit 320 includes an intra estimation unit 321 and an inter prediction unit 322 .
  • Intra estimation unit 321 may perform intra prediction to generate a predictive block for a PU.
  • Intra estimation unit 321 may use an intra prediction mode to generate a prediction block for a PU based on pixel blocks of spatially neighboring PUs.
  • Intra estimation unit 321 may also determine the intra prediction mode of the PU from one or more syntax elements parsed from the codestream.
  • the inter prediction unit 322 may construct a first reference picture list (list 0) and a second reference picture list (list 1) according to the syntax elements parsed from the codestream. Furthermore, if the PU is encoded using inter prediction, entropy decoding unit 310 may parse the motion information for the PU. Inter prediction unit 322 may determine one or more reference blocks for the PU according to the motion information of the PU. Inter prediction unit 322 may generate a predictive block for the PU from one or more reference blocks for the PU.
  • Inverse quantization transform unit 330 may inverse quantize (ie, dequantize) the transform coefficients associated with a TU. Inverse quantization transform unit 330 may use a QP value associated with a CU of a TU to determine the degree of quantization.
  • inverse quantized transform unit 330 may apply one or more inverse transforms to the inverse quantized transform coefficients in order to generate a residual block associated with the TU.
  • Reconstruction unit 340 uses the residual blocks associated with the TUs of the CU and the prediction blocks of the PUs of the CU to reconstruct the pixel blocks of the CU. For example, the reconstruction unit 340 may add the samples of the residual block to the corresponding samples of the prediction block to reconstruct the pixel block of the CU to obtain the reconstructed image block.
  • Loop filtering unit 350 may perform deblocking filtering operations to reduce blocking artifacts of pixel blocks associated with a CU.
  • Video decoder 300 may store the reconstructed picture of the CU in decoded picture cache 360 .
  • the video decoder 300 may use the reconstructed picture in the decoded picture buffer 360 as a reference picture for subsequent prediction, or transmit the reconstructed picture to a display device for presentation.
  • the basic flow of video encoding and decoding is as follows: at the encoding end, a frame of image is divided into blocks, and for the current block, the prediction unit 210 uses intra-frame prediction or inter-frame prediction to generate the prediction block of the current block .
  • the residual unit 220 may calculate a residual block based on the predicted block and the original block of the current block, for example, subtract the predicted block from the original block of the current block to obtain a residual block, which may also be referred to as residual information.
  • the residual block can be transformed and quantized by the transformation/quantization unit 230 to remove information that is not sensitive to human eyes, so as to eliminate visual redundancy.
  • the residual block before being transformed and quantized by the transform/quantization unit 230 may be called a time domain residual block, and the time domain residual block after being transformed and quantized by the transform/quantization unit 230 may be called a frequency residual block or a frequency-domain residual block.
  • the entropy encoding unit 280 receives the quantized transform coefficients output by the transform and quantization unit 230 , may perform entropy encoding on the quantized transform coefficients, and output a code stream.
  • the entropy coding unit 280 can eliminate character redundancy according to the target context model and the probability information of the binary code stream.
  • the entropy decoding unit 310 can analyze the code stream to obtain the prediction information of the current block, the quantization coefficient matrix, etc., and the prediction unit 320 uses intra prediction or inter prediction for the current block based on the prediction information to generate a prediction block of the current block.
  • the inverse quantization transformation unit 330 uses the quantization coefficient matrix obtained from the code stream to perform inverse quantization and inverse transformation on the quantization coefficient matrix to obtain a residual block.
  • the reconstruction unit 340 adds the predicted block and the residual block to obtain a reconstructed block.
  • the reconstructed blocks form a reconstructed image, and the loop filtering unit 350 performs loop filtering on the reconstructed image based on the image or based on the block to obtain a decoded image.
  • the encoding end also needs similar operations to the decoding end to obtain the decoded image.
  • the decoded image may also be referred to as a reconstructed image, and the reconstructed image may be a subsequent frame as a reference frame for inter-frame prediction.
  • the block division information determined by the encoder as well as mode information or parameter information such as prediction, transformation, quantization, entropy coding, and loop filtering, etc., are carried in the code stream when necessary.
  • the decoding end analyzes the code stream and analyzes the existing information to determine the same block division information as the encoding end, prediction, transformation, quantization, entropy coding, loop filtering and other mode information or parameter information, so as to ensure the decoding image obtained by the encoding end It is the same as the decoded image obtained by the decoder.
  • the above is the basic process of the video codec under the block-based hybrid coding framework. With the development of technology, some modules or steps of the framework or process may be optimized. This application is applicable to the block-based hybrid coding framework.
  • the basic process of the video codec but not limited to the framework and process.
  • the traditional hybrid coding framework may be improved through the methods shown in the following examples.
  • Example 1 using a sub-pixel interpolation filter based on Super Resolution Convolutional Neural Network (SRCNN) for half-pixel motion compensation of HEVC.
  • SRCNN Super Resolution Convolutional Neural Network
  • Example 2 using a new fully connected network IPFCN (Intra Prediction using Full connected Network, intra prediction using a fully connected network) for intra prediction of HEVC, and expanding the reference pixels of intra prediction as the input of the vector, Thus predicting the pixel value of the current block.
  • IPFCN Intra Prediction using Full connected Network, intra prediction using a fully connected network
  • Example 3 using a convolutional neural network for intra-frame coding acceleration, and using the network to classify CUs of different depths to predict the CU partition method for intra-frame coding, thereby replacing the way of traversing different partitions in the traditional HEVC rate-distortion optimization method.
  • the codec framework used for compression may be an end-to-end codec network framework.
  • the end-to-end encoding and decoding network is an encoding and decoding network based on a recurrent neural network (Recurrent Neural Network, RNN).
  • RNN Recurrent Neural Network
  • the image is input into a multi-round shared cyclic neural network, and the reconstruction residual output of each round is used as the input of the next round of cyclic neural network, and the code rate is controlled by controlling the number of cycles to obtain scalable coding. Effect.
  • the end-to-end encoding and decoding network is an end-to-end image encoding network based on a convolutional neural network (Convolution Neural Network, CNN).
  • CNN convolution Neural Network
  • the generalized division method is used to normalize the activation function, and the transformation function coefficients output by the network are uniformly quantized, and the quantization process is simulated by adding uniform noise during training, so as to solve the problem of quantization non-derivability in network training.
  • GSM Gaussian Scale Mixture
  • a Gaussian Mixture Model (GMM) super prior model can be used to replace GSM, and an autoregressive contextual conditional probability model based on PixelCNN structure is used to reduce bit rate and improve modeling accuracy.
  • GSM Gaussian Mixture Model
  • the end-to-end encoding and decoding network is a Lee encoding and decoding network
  • the Lee encoding and decoding network adopts a transfer learning method to improve the quality of the image reconstructed by the network.
  • the end-to-end codec network is a Hu codec network, which successfully builds compact and expressive representations at low bitrates by exploiting the intrinsic transferability between different tasks, To support a diverse set of machine vision tasks including high-level semantic related tasks and intermediate geometric parsing tasks.
  • the codec network enhances low-level visual features by using high-level semantic maps, and verifies that this method can effectively improve the bit rate, accuracy and distortion performance of image compression.
  • videos and images are not only required to be presented to users with high-quality viewing, but also more used to analyze and understand the semantic information in them.
  • the intelligent task network involved in the embodiment of the present application includes, but is not limited to, an object recognition network, an object detection network, and an instance segmentation network.
  • the end-to-end codec network usually firstly uses the neural network to compress the image/video, then transmits the compressed code stream to the decoder, and finally decodes and reconstructs the image/video at the decoder.
  • the flow of the end-to-end codec network is shown in Figure 5A, where E1 and E2 modules form the encoding end of the end-to-end codec network, and D2 and D1 modules form the decoding end of the end-to-end codec network.
  • the E1 module is the feature extraction network, which extracts features from the image;
  • the E2 module is the feature encoding module, which continues to extract features and encodes the extracted features into code streams;
  • the D2 module is the feature decoding module, which restores the code stream decoding to features and reconstructs them to the low-level features;
  • the D1 module is the decoding network, which reconstructs the image from the features reconstructed by D2.
  • FCN Full Convolutional Networks
  • ReLU Rectified Linear Unit, linear rectification function
  • leaky ReLU the leaky activation function
  • abs represents the absolute value
  • exp represents the power function of e .
  • the intelligent task network performs intelligent task analysis on the input image/video content, including but not limited to target recognition, instance segmentation and other tasks.
  • the flow of the intelligent task network is shown in Figure 6A, where the A1 module is a feature extraction network, which is used to extract low-level features from reconstructed images/videos.
  • the A2 module is an intelligent analysis network, which continues to extract features and perform intelligent analysis on the extracted features.
  • the above-mentioned intelligent task network is the target recognition network yolo_v3 (you only look once_version3, only look at part 3) as shown in Figure 6B, the division of the A1 module and the A2 module is shown in the dotted box in Figure 6B.
  • the above-mentioned intelligent task network is the target detection network ResNet-FPN (Residual Networks-Feature Pyramid Networks, residual-feature pyramid network) as shown in Figure 6C
  • ResNet-FPN Residual Networks-Feature Pyramid Networks, residual-feature pyramid network
  • the optional above-mentioned intelligent task network can also be an instance segmentation network Mask RCNN (Mask Region-CNN, mask-based RCNN).
  • Mask RCNN Mask Region-CNN, mask-based RCNN.
  • the image is compressed and stored first, and then decompressed for analysis.
  • the task analysis is based on the image, that is to say, the decoding network reconstructs the image, and the reconstructed image is input into the task analysis network to perform the task. Analysis, resulting in long time-consuming task analysis, a large amount of calculation, and low efficiency.
  • the embodiment of the present application inputs the feature information output by the middle layer of the decoding network into the task analysis network, so that the task analysis network performs task analysis based on the feature information output by the decoding network, saving the time and calculations occupied by task analysis. resources, thereby improving the efficiency of task analysis.
  • FIG. 7 is a schematic flowchart of a video decoding method provided by an embodiment of the present application.
  • the execution body of the embodiment of the present application can be understood as the decoder shown in Figure 1, as shown in Figure 7, including:
  • Fig. 8A is a schematic diagram of a network model involved in an embodiment of the present application.
  • the network model includes a decoding network and a task analysis network, wherein the output terminal of the i-th intermediate layer of the decoding network is connected to the j-th layer of the task analysis network The input terminals of the intermediate layers are connected, so that the first feature information output by the i-th intermediate layer of the decoding network can be used as the input of the j-th intermediate layer of the task analysis network, so that the task analysis network can be based on the j-th intermediate layer
  • the input feature information is used for task analysis.
  • the embodiment of the present application is compared with the decoding network that decodes the feature information of all layers, performs image reconstruction, and inputs the reconstructed image into the task analysis network, so that the task analysis network performs task analysis based on the reconstructed image.
  • Part of the feature information needs to be decoded, such as the feature information of the i-th intermediate layer, without decoding the feature information of all layers, and without rebuilding the image, thereby saving the time and computing resources occupied by task analysis and improving task analysis. s efficiency.
  • the embodiment of the present application does not limit the specific network structure of the decoding network.
  • the above decoding network may be a separate neural network. During model training, the decoding network is trained separately.
  • the above-mentioned decoding network is a decoding part of an end-to-end codec network.
  • the decoding part and the encoding part are trained end-to-end together.
  • the end-to-end codec network is also called an autoencoder.
  • the decoding network includes a decoding unit and a first decoding sub-network
  • the decoding unit is used to decode the feature code stream
  • the first decoding sub-network is used to decode the feature code stream decoded by the decoding unit
  • the information is re-extracted to reconstruct the image.
  • the decoding unit can be understood as an entropy decoding unit, which can perform entropy decoding on the feature code stream to obtain initial feature information of the current image
  • the decoding unit can be a neural network.
  • the above i-th intermediate layer is other layers in the first decoding sub-network except the output layer, that is, the i-th intermediate layer is the input layer or any intermediate layer of the first decoding sub-network.
  • S701 includes the following S701-A and S701-B:
  • S701-A Input the characteristic code stream of the current image into the decoding unit, and obtain the initial characteristic information of the current image output by the decoding unit;
  • the decoding network may include an inverse quantization unit in addition to the decoding unit and the first decoding sub-network.
  • the above S701-B includes the following S701-B1 and S701-B2 A step of:
  • the decoding unit in the decoding network decodes the feature code stream to obtain the initial feature information, which has been quantized in the encoding network, so the decoding network needs to dequantize the initial feature information , specifically, inputting the initial feature information into the dequantization unit for inverse quantization to obtain dequantized feature information, and then input the dequantized feature information into the first decoding subnetwork to obtain the first decoding subnetwork
  • the first feature information output by the i intermediate layers.
  • the encoding network when the encoding network encodes the current image, it not only encodes the feature information of the current image to form a feature code stream, but also estimates the occurrence probability distribution of the decoding point of the current image, and calculates the probability of the decoding point distribution to form a decoding point probability distribution code stream (also called a probability estimation code stream) of the current image. In this way, in addition to decoding the feature code stream, the decoding network also needs to decode the decoding point probability distribution code stream.
  • the decoding network further includes a second decoding subnetwork, and the second decoding subnetwork is used to decode the decoding point probability distribution code stream.
  • the embodiment of the present application further includes: inputting the probability distribution code stream of the decoding points of the current image into the second decoding subnetwork to obtain the probability distribution of the decoding points of the current image.
  • the above S701-A includes: inputting the feature code stream of the current image and the probability distribution of the decoding points of the current image into the decoding unit to obtain the initial feature information of the current image output by the decoding unit.
  • the above-mentioned second decoding sub-network may be a super prior network.
  • the embodiment of the present application does not limit the specific network structure of the task analysis network.
  • the task analysis network may be an object recognition network, an object detection network, an instance segmentation network, a classification network, and the like.
  • the embodiment of the present application does not limit the specific selection of the i-th intermediate layer and the j-th intermediate layer.
  • the above-mentioned i-th intermediate layer can be any intermediate layer in the decoding network except the input layer and the output layer
  • the j-th intermediate layer can be any intermediate layer in the decoding network except the input layer and the output layer. any middle layer.
  • the i-th intermediate layer and the j-th intermediate layer are the two intermediate layers with the highest feature similarity and/or the smallest model loss in the decoding network and the task analysis network.
  • the calculation process of the feature similarity can be: in the network model building stage, input the image A into the encoding network to obtain the code stream of the image A, input the code stream of the image A into the decoding network, and obtain each code stream of the decoding network.
  • the reconstructed image is input into the task analysis network, and the feature information input by each intermediate layer of the task analysis network is obtained.
  • the similarity between the feature information of each intermediate output of the decoding network and the feature information of each intermediate layer input of the task analysis network is calculated.
  • the end-to-end encoding and decoding network as the Cheng2020 network
  • the task analysis network as the target detection network shown in Figure 9A as an example.
  • This target detection network is also called Faster RCNN R50C4 (faster regions with conventional neural network Resnet50Conv4, faster The regional convolutional neural network R50C4) network.
  • the target detection network includes a backbone network ResNet50-C4 (Residual Networks50-C4, residual network 50-C4), RPN (Region proposal network, region extraction network) and ROI-Heads (Region of interest_Heads, sense region of interest head), where the backbone network ResNet50-C4 includes 4 layers, namely Conv1, Conv2_X, Conv 3_X and Conv4_X. Where Conv is the abbreviation of convolution.
  • Conv1 includes at least one convolutional layer
  • Conv2_X includes a maximum pooling layer
  • BTINK Bottom Neck, bottleneck
  • Conv3_X includes a BTINK1 and 3 BTINK2s
  • Conv4_X includes a BTINK1 and 5 BTINK2s.
  • the network structures of BTINK1 and BTINK2 are shown in FIG. 9B
  • BTINK1 includes 4 convolutional layers
  • BTINK2 includes 3 convolutional layers.
  • the Cheng2020 network consists of Enc_GDNM (encoder generalized divisive normalization module, encoder generalized division normalization module) shown in Figure 9C, Enc_NoGDNM (encoder no generalized divisive normalization module, encoder no generalized division normalization module) Module), Dec_IGDNM (decoder inverse generalized divisive normalization module, decoder inverse generalized division normalization module) and Dec_NoIGDNM (decoder no inverse generalized divisive normalization module, decoder no inverse generalized division normalization module).
  • Figure 9D is a network example diagram of an end-to-end encoding and decoding network and a task analysis network, wherein the end-to-end encoding and decoding network is Cheng2020, and the task analysis network is a Faster RCNN R50C4 network.
  • the end-to-end codec network includes an encoding network and a decoding network, wherein the encoding network includes 9 network layers including nodes e0 to e9, and the decoding network includes 10 network layers including nodes d10 to node d0.
  • the backbone network of the task analysis network includes 4 network layers, including node F0 to node F15.
  • node e0 is the input node of the encoding network
  • node d0 is the output node of the decoding network
  • F0 is the input node of the task analysis network
  • the data corresponding to these three nodes is image data, for example, an image with a size of WXHX3 , where WXH is the scale of the image, and 3 is the number of channels of the image.
  • the sizes of the convolution kernels of each layer in the network shown in FIG. 9D are as shown in Table 1:
  • the convolution kernel in Table 1 is “[3 ⁇ 3,N],/2”, 3 ⁇ 3 is the size of the convolution kernel, N is the number of channels, /2 means downsampling, and 2 is the multiple of downsampling.
  • the convolution kernel in Table 1 is “[3 ⁇ 3,N] ⁇ 2”, 3 ⁇ 3 is the size of the convolution kernel, N is the number of channels, and ⁇ 2 indicates that the number of convolution kernels is 2.
  • the convolution kernel in Table 1 is "[3 ⁇ 3,3],*2", *2 means upsampling, and 2 is the multiple of upsampling.
  • the feature information output by the middle layer corresponding to node d7 in the decoding network is determined to have the highest similarity with the feature information input by the middle layer corresponding to node F9 in the task analysis network, as shown in FIG. 9E .
  • the intermediate layer corresponding to node d7 is used as the i-th intermediate layer
  • the intermediate layer corresponding to F9 is used as the j-th intermediate layer
  • the output end of the intermediate layer corresponding to node d7 is connected to the input end of the intermediate layer corresponding to F9.
  • the feature information output by the middle layer corresponding to node d5 in the decoding network is determined to have the highest similarity with the feature information input by the middle layer corresponding to node F5 in the task analysis network, as shown in FIG. 9F .
  • the intermediate layer corresponding to node d5 is used as the i-th intermediate layer, and the intermediate layer corresponding to F5 is used as the j-th intermediate layer, and then the output end of the intermediate layer corresponding to node d5 is connected to the input end of the intermediate layer corresponding to F5.
  • the feature information output by the middle layer corresponding to node d2 in the decoding network is determined to have the highest similarity with the feature information input by the middle layer corresponding to node F1 in the task analysis network, as shown in FIG. 9G .
  • the intermediate layer corresponding to node d2 is used as the i-th intermediate layer, and the intermediate layer corresponding to F2 is used as the j-th intermediate layer, and then the output end of the intermediate layer corresponding to node d2 is connected to the input end of the intermediate layer corresponding to F1.
  • the feature similarity between the i-th intermediate layer and the j-th intermediate layer includes at least one of the following: the similarity between the feature map output by the i-th intermediate layer and the feature map input by the j-th intermediate layer , the similarity between the feature size output by the i-th intermediate layer and the feature size input by the j-th intermediate layer, the statistical histogram of the feature map output by the i-th intermediate layer and the feature map input by the j-th intermediate layer Statistical similarity between histograms.
  • node d5 is connected to node F5
  • the connection of nodes here can be understood as the connection of two intermediate layers, for example, node d5 is the node of an intermediate layer in the decoding network.
  • the output terminal, the node F5 is the input terminal of an intermediate layer of the task analysis network.
  • Input an image B into the model shown in Figure 9D, and the encoding network performs feature encoding on the image B to obtain a code stream.
  • the decoding network decodes the code stream to obtain the feature information 1 of node d5, and inputs the feature information 1 into node F5 for task analysis, and obtains the classification result 1 predicted by the task analysis network based on the feature information 1.
  • the calculation task analyzes the loss 1 between the classification result 1 predicted by the network and the true value of the classification result corresponding to the image B, and determines the loss of the current model according to the loss 1.
  • connect node d5 to node F9 refer to the above process, and calculate the loss of the model when node d5 is connected to node F9.
  • the loss of the model can be calculated when different nodes in the decoding network are connected to different nodes in the task analysis network.
  • the two connected intermediate layers (or two nodes) corresponding to the minimum model loss may be determined as the i-th intermediate layer and the j-th intermediate layer.
  • the i-th intermediate layer and the j-th intermediate layer may be determined according to the feature similarity and model loss between the two intermediate layers. For example, according to the calculation method of the above-mentioned feature similarity, the feature similarity between the middle layer of the decoding network and the middle layer of the task analysis network is calculated, and the loss of the model when the two middle layers are connected is calculated, and the feature similarity and the model loss The two intermediate layers with the smallest sum are determined as the i-th intermediate layer and the j-th intermediate layer.
  • the i-th intermediate layer and the j-th intermediate layer may be determined through the following several examples.
  • Example 1 first randomly select an intermediate layer from the decoding network as the i-th intermediate layer, and determine the intermediate layer in the task analysis network with the highest feature similarity with the i-th intermediate layer as the j-th intermediate layer.
  • Example 2 first randomly select an intermediate layer from the decoding network as the i-th intermediate layer, try to connect each intermediate layer in the task analysis network to the i-th intermediate layer, and then determine the different intermediate layers in the task analysis network After connecting with the i-th intermediate layer in the decoding network, the model loss of the network model determines the intermediate layer corresponding to the minimum model loss as the j-th intermediate layer.
  • Example 3 first randomly select an intermediate layer from the decoding network as the i-th intermediate layer, determine the feature similarity between each intermediate layer in the task analysis network and the i-th intermediate layer, and determine the different intermediate layers in the task analysis network After the layer is connected to the i-th intermediate layer in the decoding network, the model loss of the network model is determined to determine the sum of the feature similarity and model loss corresponding to each intermediate layer in the task analysis network, and the intermediate layer corresponding to the minimum sum value, Determined as the jth intermediate layer.
  • Example 4 first randomly select an intermediate layer from the task analysis network as the j-th intermediate layer, and determine the intermediate layer in the decoding network with the highest feature similarity with the j-th intermediate layer as the i-th intermediate layer.
  • Example 5 first randomly select an intermediate layer from the task analysis network as the jth intermediate layer, try to connect each intermediate layer in the decoding network to the jth intermediate layer, and then determine the different intermediate layers in the decoding network and After the j-th intermediate layer in the task analysis network is connected, the model loss of the network model determines the intermediate layer corresponding to the minimum model loss as the i-th intermediate layer.
  • Example 6 first randomly select an intermediate layer from the task analysis network as the jth intermediate layer, determine the feature similarity between each intermediate layer in the decoding network and the jth intermediate layer, and determine the different intermediate layers in the decoding network After connecting with the jth intermediate layer in the task analysis network, the model loss of the network model is determined, and the sum of the feature similarity and model loss corresponding to each intermediate layer in the decoding network is determined, and the intermediate layer corresponding to the minimum sum value is determined. is the i-th intermediate layer.
  • the process of determining the i-th intermediate layer and the j-th intermediate layer is performed during the network construction process.
  • S702 includes the following S702-A1 and S702-A2:
  • a feature adapter is set between the i-th intermediate layer of the decoding network and the j-th intermediate layer of the task analysis network.
  • the size of the feature information input by the input end of the jth intermediate layer can be set in advance.
  • the feature adapter may be a neural network unit, such as including a pooling layer or a convolutional layer, and this type of feature adapter is called a neural network-based feature adapter.
  • the feature adapter may be an algorithm unit, which is used to perform one or several kinds of calculations to realize the conversion of feature information size, and this type of feature adapter is called a non-neural network-based feature adapter.
  • the size of the feature information includes the size of the feature information and/or the number of channels of the feature information.
  • the above-mentioned feature adapter is used to adapt the number of channels. That is, in the above S702-A1, inputting the first feature information into the feature adapter for feature adaptation includes the following situations:
  • ways to reduce the number of channels of the first feature information to the same number of input channels as the j-th intermediate layer include but are not limited to the following:
  • Method 1 if the feature adapter is a feature adapter based on a non-neural network, input the first feature information into the feature adapter, so that the feature adapter adopts the principal component analysis (Principal Component Analysis, PCA) method or a random selection method, from the first feature
  • PCA Principal Component Analysis
  • the number of channels of the input channel of the j-th intermediate layer is selected from the channels of information.
  • the number of channels of the first feature information is 64 and the number of input channels of the j-th intermediate layer is 32, then 32 channels can be randomly selected from the 64 channels of the first feature information and input to the j-th intermediate layer middle.
  • the first feature information is input into the feature adapter, so that the feature adapter selects the main feature channels with the same number of input channels as the j-th intermediate layer from the channels of the first feature information by means of PCA.
  • PCA is a common data analysis method, which is often used for dimensionality reduction of high-dimensional data and can be used to extract the main feature components of data.
  • Method 2 if the feature adapter is a neural network-based feature adapter, input the first feature information into the feature adapter, and reduce the number of channels of the first feature information to the jth channel number through at least one convolutional layer in the feature adapter
  • the middle layers have the same number of input channels.
  • the number of channels of the first feature information can be reduced by reducing the number of convolution layers in the feature adapter and/or reducing the number of convolution kernels.
  • the ways to increase the number of channels of the first feature information to be the same as the number of input channels of the j-th intermediate layer include but are not limited to the following:
  • Method 1 if the number of input channels of the j-th intermediate layer is an integer multiple of the number of channels of the first feature information, copy the channels of the first feature information to an integer multiple, so that the number of channels of the copied first feature information is equal to
  • the j-th intermediate layer has the same number of input channels.
  • the number of channels of the first feature information is 32, and the number of input channels of the j-th intermediate layer is 64, then the 32 channels of the first feature information are copied to obtain feature information of 64 channels.
  • Method 2 if the number of input channels of the j-th intermediate layer is not an integer multiple of the number of channels of the first feature information, copy the channels of the first feature information N times, and select M from the channels of the first feature information channel, after copying the M channels, merge them with the channels that replicate N times the first feature information, so that the number of channels of the merged first feature information is the same as the number of input channels of the jth intermediate layer, and N is the The quotient of dividing the number of input channels of the j intermediate layer by the number of channels of the first feature information, M is the remainder after dividing the number of input channels of the jth intermediate layer and the number of channels of the first feature information, N, M All are positive integers.
  • the number of channels of the first feature information is 64
  • the number of input channels of the j-th intermediate layer is 224
  • the quotient of dividing 224 and 64 is 3, and the remainder is 32, that is, N is 3 and M is 32
  • the The original channels of the first feature information are copied 3 times to obtain 192 channels.
  • 32 channels are selected from the original 64 channels of the first feature information, and the 32 channels are copied, and the copied 32 channels are copied.
  • channels are merged with the 192 channels copied above to obtain 224 channels, and these 224 channels are used as channels of the combined first feature information.
  • the above-mentioned method of selecting 32 channels from the original 64 channels of the first feature information may be randomly selected, or selected by PCA, or selected by other methods, and this application does not do this limit.
  • the combination of the above 32 channels and 192 channels may be that the 32 channels are placed behind the 192 channels, or placed before the 92 channels, or interspersed in the 192 channels, which is not limited in this application.
  • Method 3 Select P main feature channels from the channels of the first feature information, copy the P main feature channels and merge them with the channels of the first feature information, so that the number of channels of the merged first feature information The same as the number of input channels of the j-th intermediate layer, P is the difference between the number of input channels of the j-th intermediate layer and the number of channels of the first feature information, and P is a positive integer.
  • the number of channels of the first feature information is 192
  • the number of input channels of the j-th intermediate layer is 256
  • Select 64 channels from the 192 channels of the first feature information copy the 64 channels and merge them with the original 192 channels of the first feature information to obtain 256 channels.
  • the above-mentioned method of selecting 64 channels from the original 192 channels of the first feature information may be randomly selected, or selected by PCA, or selected by other methods, and this application does not do this limit.
  • the first feature information can be input into the feature adapter, and the first feature information can be input to the feature adapter through at least one convolutional layer in the feature adapter.
  • the number of channels of feature information is increased to be the same as the number of input channels of the jth intermediate layer.
  • the number of channels of the first feature information may be increased by increasing the number of convolution layers in the feature adapter and/or increasing the number of convolution kernels.
  • the size of the first feature information is The input size of the jth intermediate layer is At this time, through at least one convolutional layer in the feature adapter, the size of the first feature information is increased to
  • the feature adapter is used for size adaptation. That is, the implementation of inputting the first feature information into the feature adapter in the above S702-A1 for feature adaptation includes the following situations:
  • the first feature information is down-sampled to the same size as the input of the j-th intermediate layer through the feature adapter.
  • the way of downsampling the first feature information to the same size as the input of the j-th intermediate layer through the feature adapter includes but is not limited to the following:
  • Method 1 if the feature adapter is a feature adapter based on a non-neural network, the first feature information is down-sampled through the feature adapter, so that the size of the down-sampled first feature information is the same as the input size of the j-th intermediate layer same.
  • the size of the first feature information is The input size of the jth intermediate layer is At this time, the feature dimensions are matched by duplicating the number of channels and performing upsampling.
  • Method 2 if the feature adapter is a neural network-based feature adapter, at least one pooling layer in the feature adapter is used to downsample the size of the first feature information to be the same as the input size of the jth intermediate layer.
  • the aforementioned pooling layer may be a maximum pooling layer, an average pooling layer, an overlapping pooling layer, and the like.
  • the feature adapter is used to upsample the first feature information to be the same as the input size of the j-th intermediate layer.
  • the way of upsampling the first feature information to the same input size as the j-th intermediate layer through the feature adapter includes but not limited to the following:
  • Method 1 if the feature adapter is a non-neural network-based feature adapter, then use the feature adapter to upsample the first feature information so that the size of the upsampled first feature information is the same as the input size of the jth intermediate layer .
  • the size of the first feature information is The input size of the jth intermediate layer is At this time, the above sampling can be used to match the feature dimensions.
  • Method 2 if the feature adapter is a neural network-based feature adapter, at least one pooling layer in the feature adapter is used to upsample the size of the first feature information to be the same as the input size of the jth intermediate layer.
  • the feature adapter can be understood as an upsampling unit, for example, the feature adapter can include a bilinear interpolation layer and/or a deconvolution layer and/or an anti-pooling layer and/or an upper pooling layer wait.
  • the feature adapter upsamples the first feature information, so that the size of the upsampled first feature information is the same as the input size of the jth intermediate layer.
  • the input end of the feature adapter can be connected to the output end of the ith intermediate layer of the decoding network, and the output end of the feature adapter can be connected to the output end of the jth intermediate layer of the task analysis network.
  • the input end is connected to convert the size of the first feature information output by the i-th intermediate layer to adapt to the input size of the j-th intermediate layer.
  • the intermediate layer corresponding to node d7 is used as the i-th intermediate layer
  • the intermediate layer corresponding to F9 is used as the j-th intermediate layer
  • the middle layer further connects a feature adapter between node d7 and node F9, and the feature adapter is used to convert the first feature information output by the middle layer corresponding to d7 into second feature information and input it to the middle layer corresponding to node F9.
  • the intermediate layer corresponding to node d5 is used as the i-th intermediate layer
  • the intermediate layer corresponding to F5 is used as the j-th intermediate layer
  • the middle layer further connects a feature adapter between node d5 and node F5, and the feature adapter is used to convert the first feature information output by the middle layer corresponding to d5 into second feature information and input it to the middle layer corresponding to node F5.
  • the intermediate layer corresponding to node d2 is used as the i-th intermediate layer
  • the intermediate layer corresponding to F1 is used as the j-th intermediate layer
  • the middle layer further connects a feature adapter between the node d2 and the node F1, and the feature adapter is used to convert the first feature information output by the middle layer corresponding to d2 into the second feature information and input it to the middle layer corresponding to the node F1.
  • the first feature information output by the i-th intermediate layer of the decoding network is obtained by inputting the feature code stream of the current image into the decoding network, where i is a positive integer; the first feature information is input In the jth intermediate layer of the task analysis network, the task analysis result output by the task analysis network is obtained, and j is a positive integer.
  • the feature information output by the middle layer of the decoding network is input into the task analysis network, so that the task analysis network performs task analysis based on the feature information output by the decoding network, which saves the time and computing resources occupied by task analysis, thereby improving the performance of task analysis. efficiency.
  • Fig. 11 is a schematic flow chart of a video decoding method provided by an embodiment of the present application. As shown in Fig. 11, the method of the embodiment of the present application includes:
  • the first feature information output by the i-th intermediate layer of the decoding network can be input into the j-th intermediate layer of the task analysis network, so that The task analysis network performs task analysis based on the first feature information, and outputs task analysis results.
  • the decoding network continues to perform subsequent feature recovery to realize the reconstruction of the current image, and output the reconstructed image of the current image, which can meet the task analysis and image display scenarios.
  • this embodiment of the present application may further include the following steps of S802 and S803.
  • Figure 12 is a schematic structural diagram of a decoding network and a task analysis network involved in an embodiment of the present application. As shown in Figure 12, the i-th intermediate layer of the decoding network is connected to the j-th intermediate layer of the task analysis network, and the decoding network's The output is connected to the input of the task analysis network.
  • the first feature information output by the i-th intermediate layer of the decoding network and the reconstructed image of the current image finally output by the decoding network are obtained.
  • the analysis network performs task analysis based on the third feature information and the first feature information. Since the third feature information is obtained by reconstructing the image, it can reflect the features of the reconstructed image. In this way, based on the first feature information and the third feature information, the task When analyzing, the accuracy of task analysis can be improved.
  • the above-mentioned S803 input the third feature information and the first feature information into the jth intermediate layer, and obtain the task analysis result output by the task analysis network, including but not limited to the following:
  • Way 1 combine the third feature information with the first feature information, input the combined feature information into the jth intermediate layer, and obtain the task analysis result output by the task analysis network.
  • the above combination may be operations such as cascading of different weights, fusion of different weights, or weighted average.
  • the aforementioned feature converter may be used to convert the third feature information and the first feature information to be of the same size and then concatenate them.
  • the above feature converter can be used to convert the size of the concatenated feature information to the input size of the jth intermediate layer After they are consistent, they are input into the jth intermediate layer.
  • the size of the third feature information and/or the first feature information may be converted first, so that the size of the converted first feature information and/or the third feature information after concatenation is the same as The input size of the jth intermediate layer is the same.
  • Method 2 add the third characteristic information to the first characteristic information, input the added characteristic information into the jth intermediate layer, and obtain the task analysis result output by the task analysis network.
  • Method 2 Multiply the third characteristic information and the first characteristic information, input the multiplied characteristic information into the jth intermediate layer, and obtain the task analysis result output by the task analysis network.
  • the decoding network and the task analysis network are trained end-to-end together.
  • the decoding network and the encoding network are trained end-to-end together.
  • the encoding network, decoding network and task analysis network are trained end-to-end together.
  • the target loss of the encoding network, decoding network and task analysis network during training is according to the bit stream of the feature information output by the encoding network At least one of the rate, the bit rate of the decoding point probability distribution code stream, and the task analysis result loss of the task analysis network is determined.
  • the target loss is the sum of the task analysis result loss, the bit rate of the feature information code stream and the bit rate of the decoding point probability distribution code stream.
  • the target loss is the product of the preset parameter and the loss of the task analysis result, and the sum of the bit rate of the feature information code stream and the bit rate of the decoding point probability distribution code stream.
  • the target loss loss when the encoding network, decoding network and task analysis network are trained end-to-end together can be determined by the following formula (1):
  • the loss task is the loss of the task analysis result of the task analysis network, for example, the loss between the predicted task analysis result of the task analysis network and the true value of the task analysis result.
  • the preset parameters are related to the network model of at least one of the decoding network and the task analysis network, for example, different preset parameters ⁇ correspond to different models, that is, different total bit rates, and the total bit rate is the characteristic code stream The sum of the bit rate and the bit rate of the side information.
  • the first feature information output by the i-th intermediate layer of the decoding network and the reconstructed image of the current image output by the decoding network are obtained;
  • the reconstructed image will be input into the task analysis network to obtain the third feature information output by the j-1th layer of the task analysis network; the third feature information and the first feature information are input into the jth intermediate layer to obtain
  • the task analysis results output by the task analysis network improve the accuracy of task analysis.
  • the video decoding method of the application is introduced above, and the video coding method involved in the embodiment of the application is introduced below in combination with the embodiments.
  • FIG. 13 is a schematic flowchart of a video encoding method provided by an embodiment of the present application, and the execution body of the embodiment of the present application may be the encoder shown in FIG. 1 .
  • the method of the embodiment of the present application includes:
  • the encoding network and the decoding network are trained end-to-end together, wherein the first feature information output by the ith intermediate layer of the decoding network is input into the jth intermediate layer of the task analysis network.
  • the current image in this application can be understood as a frame of image to be encoded or a part of the frame of image in the video stream; or, the current image can be understood as a single image to be encoded or a part of the image to be encoded.
  • the encoding network includes a first encoding subnetwork and an encoding unit.
  • the above S902 includes:
  • the above encoding unit may be an entropy encoding unit, configured to perform entropy encoding on the initial feature information to obtain the feature code stream of the current image.
  • the encoding unit is a neural network.
  • the encoding network further includes a quantization unit.
  • the above S902-A2 includes:
  • the embodiment of the present application does not limit the quantization step size.
  • the encoding network further includes a second encoding subnetwork.
  • the method of the embodiment of the present application further includes: inputting initial feature information into the second encoding subnetwork to perform the probability of decoding points distribution estimation, to obtain the decoding point probability distribution code stream of the current image output by the second encoding subnetwork.
  • the above-mentioned second encoding sub-network is a super prior network.
  • the coding network is the coding part of the above-mentioned Cheng2020 coding and decoding network, and its network structure is shown in FIG. 14D.
  • the feature information is estimated through the probability distribution of the second coding sub-network to obtain the probability distribution of the occurrence of the decoding point, which is quantized and entropy encoded to generate a code stream of the probability distribution of the decoding point.
  • the attention module (attention module) in FIG. 14D is replaced by a simplified attention module (simplified attention module), and its structure is shown in FIG. 14E , wherein RB (Residual block) represents a residual block.
  • the attention module is usually used to improve the performance of image compression, but the commonly used attention module is very time-consuming during training, so the general attention module is simplified by removing non-local blocks to reduce the complexity of training.
  • the encoding network, the decoding network and the task analysis network perform end-to-end training together.
  • the target loss of the encoding network, decoding network and task analysis network during training is based on the bit rate of the feature information code stream output by the encoding network, the bit rate of the decoding point probability distribution code stream and the task of the task analysis network At least one of the analysis result losses is determined.
  • the target loss is the product of the preset parameter and the loss of the task analysis result, the sum of the bit rate of the feature information code stream and the bit rate of the decoding point probability distribution code stream.
  • the preset parameters are related to the network model of at least one of the decoding network and the task analysis network.
  • the encoding network and the decoding network in the embodiments of the present application are end-to-end encoding and decoding networks.
  • end-to-end codec networks that may be involved in the embodiments of the present application are introduced below.
  • Figure 15 is a schematic diagram of a general end-to-end codec network, where ga can be understood as the first encoding subnetwork, ha is the second encoding subnetwork, gs is the first decoding subnetwork, and hs is the second decoding subnetwork.
  • the first encoding subnetwork ga is also called the main encoding network or main encoder
  • the first decoding subnetwork gs is called the main decoding network or main decoder
  • the second encoding subnetwork ha the second decoding subnetwork
  • the network hs is called a super prior network.
  • the compression feature flow of the general-purpose end-to-end codec network is as follows: the input original picture obtains the feature information y through the first decoding sub-network g a , and the feature information y passes through the quantizer Q to obtain the quantized feature information The second encoding sub-network ha (i.e. super prior network) pairs feature information
  • the latent representation in is modeled by a Gaussian with mean 0 and variance ⁇ .
  • the second encoding subnetwork ha estimates the probability distribution z of the decoding point, and converts the probability distribution of the decoding point
  • the probability distribution z is quantized as and to Compress to form a decoding point probability distribution code stream.
  • the decoding point probability distribution code stream is input to the decoding end, and the decoding end performs decoding to obtain the probability distribution of the quantized decoding point will decode the probability distribution of points Input the second decoding sub-network hs (that is, the super prior network) at the decoding end to obtain the feature information
  • the modeling distribution of the encoding unit combines feature information
  • the modeling distribution decodes the feature code stream to get the feature information of the current image
  • the feature information of the current image Input into the first decoding subnetwork gs to get the reconstructed image IGDN (inverse generalized divisive normalization) is inverse generalized division normalization.
  • the end-to-end codec network in the embodiment of the present application is the network shown in FIG. 5C above.
  • the end-to-end codec network is also called Lee codec network.
  • the Lee codec network uses the method of transfer learning to improve the quality of the network reconstructed image. By utilizing the intrinsic transferability between different tasks, the Lee codec network adds a quality enhancement module to the framework of the basic codec network, such as GRDN (Grouped Residual Dense Network, Grouped Residual Dense Network).
  • GRDN Grouped Residual Dense Network, Grouped Residual Dense Network
  • the compression process of the Lee encoding and decoding network is as follows: input the image x into the first encoding subnetwork g a (namely the main encoding network or transformation analysis network) to obtain the implicit representation y, quantize y to y, encode y, and obtain Feature code stream. Input y into the second encoding sub-network h a (that is, the super prior model), and use the super prior model to further represent the spatial relationship z of y, where z is the probability distribution of the occurrence of decoding points. Next, the z-quantized input into the entropy encoder EC Encoding to form a code stream with a probability distribution of decoding points.
  • the code stream with a probability distribution of decoding points is also called a parameter code stream.
  • super prior parameters c′ i are obtained from the reconstruction of the parameter code stream
  • model parameters c′′ i and c′′′ i such as global context parameters are obtained from the feature code stream.
  • the first decoding sub-network reconstructs the image based on the feature information y
  • the end-to-end codec network in the embodiment of the present application is the network shown in FIG. 5D above.
  • the end-to-end codec network is also called a Hu codec network.
  • the Hu encoder-decoder network successfully constructs compact and expressive representations at low bitrates to support a diverse set of machine vision tasks including high-level semantic-related tasks and intermediate-level geometric parsing tasks.
  • the Hu codec network enhances low-level visual features by using high-level semantic maps, and verifies that this method can effectively improve the rate-accuracy-distortion performance of image compression.
  • the compression process of the Hu codec network is as follows: first extract the depth feature hi from the image, and transform the feature hi into a discrete value that is convenient for encoding and probability estimation. Since the feature distribution is unknown, in order to facilitate calculation, a hidden variable is introduced The Gaussian model of z is used to estimate the characteristic distribution, but it is very difficult to estimate the marginal probability distribution of p z , so the Hyper Analysis Transform module is used to establish a hyper prior v for z, and the hyper prior v is input to the arithmetic editor In, the parameterized distribution model q v is used to approximate the probability distribution p v , and the estimated parameter coefficient sequence of q v to decode the output.
  • the codebook ⁇ C 1 , C 2 ,...,C ⁇ ⁇ and the coefficient sequence A l is used to jointly generate a super prior Z with spatial information.
  • the arithmetic codec is used to estimate the mean and variance of the super-prior Z, thereby reconstructing the feature h′ i , where the reconstructed feature is the feature output that considers the spatial dimension and the feature output that does not consider the spatial dimension, which are used to perform Smart tasks and analyze statistical features of images.
  • the end-to-end codec network in the embodiment of the present application is the network shown in FIG. 5B above.
  • the end-to-end codec network is also called Cheng2020 codec network.
  • the compression process of the Cheng2020 codec network is consistent with the compression process of the general end-to-end codec network shown in Figure 15. The difference is that it does not use a Gaussian model, but a discrete Gaussian mixture likelihood. For the specific compression process, refer to the above part of Figure 15. description and will not be repeated here.
  • the end-to-end codec network in the embodiment of the present application may also be another end-to-end codec network.
  • the encoding network and the decoding end are separate neural networks, not an end-to-end neural network.
  • FIGS. 7 to 15 are only examples of the present application, and should not be construed as limiting the present application.
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the order of execution of the processes should be determined by their functions and internal logic, and should not be used in this application.
  • the implementation of the examples constitutes no limitation.
  • the term "and/or" is only an association relationship describing associated objects, indicating that there may be three relationships. Specifically, A and/or B may mean: A exists alone, A and B exist simultaneously, and B exists alone.
  • the character "/" in this article generally indicates that the contextual objects are an "or" relationship.
  • Fig. 16 is a schematic block diagram of a video decoder provided by an embodiment of the present application.
  • the video decoder 10 includes:
  • the decoding unit 11 is configured to input the feature code stream of the current image into the decoding network to obtain the first feature information output by the i-th intermediate layer of the decoding network, where i is a positive integer;
  • the task unit 12 is configured to input the first feature information into the jth intermediate layer of the task analysis network, and obtain a task analysis result output by the task analysis network, where j is a positive integer.
  • the task unit 12 is specifically configured to input the first feature information into the feature adapter to perform feature adaptation to obtain second feature information, and the size of the second feature information is the same as that of the j-th middle
  • the preset input sizes of the layers are the same; the second characteristic information is input into the jth intermediate layer, and the task analysis result output by the task analysis network is obtained.
  • the size of the feature information includes the size and/or number of channels of the feature information.
  • the task unit 12 is specifically configured to: if the number of channels of the first feature information is greater than the number of input channels of the j-th intermediate layer, then Reduce the number of channels of the first feature information to be the same as the number of input channels of the jth intermediate layer through the feature adapter; if the number of channels of the first feature information is smaller than the number of channels of the jth intermediate layer The number of input channels of the first feature information is increased to be the same as the number of input channels of the j-th intermediate layer through the feature adapter.
  • the task unit 12 is specifically configured to input the first feature information into the feature adapter if the feature adapter is a non-neural network-based feature adapter, so that the feature adapter adopts principal component analysis PCA mode or random selection mode, select the number of channels of the input channel of the j-th intermediate layer from the channel of the first feature information; if the feature adapter is a feature adapter based on a neural network, the The first feature information is input into the feature adapter, and the number of channels of the first feature information is reduced to be the same as the number of input channels of the jth intermediate layer through at least one convolutional layer in the feature adapter. .
  • the task unit 12 is specifically configured to: if the number of input channels of the j-th intermediate layer is the number of channels of the first feature information an integer multiple of the first feature information, then copy the integer multiple of the channels of the first feature information, so that the number of channels of the copied first feature information is the same as the number of input channels of the j-th intermediate layer; or,
  • the number of input channels of the j-th intermediate layer is not an integer multiple of the number of channels of the first feature information
  • copy the channels of the first feature information N times, and start from the channel of the first feature information M channels are selected, and after the M channels are copied, they are merged with the channels that replicate N times the first feature information, so that the number of channels of the combined first feature information is the same as that of the jth
  • the number of input channels of the middle layer is the same
  • the N is the quotient of dividing the number of input channels of the jth middle layer by the number of channels of the first feature information
  • the M is the number of the jth middle layer
  • the remainder after dividing the number of input channels and the number of channels of the first feature information, the N and M are both positive integers; or,
  • the number of channels of the feature information is the same as the number of input channels of the jth intermediate layer, and the P is the difference between the number of input channels of the jth intermediate layer and the number of channels of the first feature information,
  • the P is a positive integer
  • the task unit 12 is specifically configured to input the first feature information into the feature adapter, through at least one of the feature adapters A convolutional layer, increasing the number of channels of the first feature information to be the same as the number of input channels of the j-th intermediate layer.
  • the task unit 12 is specifically configured to input the first feature information into the feature adapter, so that the feature adapter adopts principal component analysis (PCA) to select from channels of the first feature information output the same number of channels as the input channels of the jth intermediate layer.
  • PCA principal component analysis
  • the task unit 12 is specifically configured to, if the feature adapter is a non-neural network-based feature adapter, downsample the first feature information through the feature adapter, so that the downsampled The size of the first feature information is the same as the input size of the jth intermediate layer; if the feature adapter is a feature adapter based on a neural network, then through at least one pooling layer in the feature adapter, the The size of the first feature information is down-sampled to be the same as the input size of the jth intermediate layer.
  • the pooling layer is any one of a maximum pooling layer, an average pooling layer, and an overlapping pooling layer.
  • the task unit 12 is specifically configured to upsample the first feature information through the feature adapter if the feature adapter is a non-neural network-based feature adapter, so that the upsampled
  • the size of the first feature information is the same as the input size of the jth intermediate layer; if the feature adapter is a feature adapter based on a neural network, through at least one upper pooling layer in the feature adapter, the The size of the first feature information is up-sampled to be the same as the input size of the jth intermediate layer.
  • the task unit 12 is specifically configured to: if the size of the first feature information is greater than the input size of the j-th intermediate layer, through the The feature adapter downsamples the first feature information to be the same as the input size of the jth intermediate layer; if the size of the first feature information is smaller than the input size of the jth intermediate layer, the The feature adapter upsamples the first feature information to the same size as the input of the jth intermediate layer.
  • the decoding unit 11 is further configured to input the feature code stream of the current image into the decoding network to obtain a reconstructed image of the current image output by the decoding network.
  • the task unit 12 is specifically configured to input the reconstructed image into the task analysis network to obtain the third feature information output by the j-1th layer of the task analysis network;
  • the feature information and the first feature information are input into the j-th intermediate layer to obtain a task analysis result output by the task analysis network.
  • the task unit 12 is specifically configured to combine the third feature information with the first feature information, input the combined feature information into the jth intermediate layer, and obtain the task Analyze the task analysis results output by the network.
  • the decoding network includes a decoding unit and a first decoding sub-network
  • the decoding unit 11 is specifically configured to input the feature code stream of the current image into the decoding unit to obtain the decoding unit Output the initial feature information of the current image; input the initial feature information into the first decoding subnetwork to obtain the first feature information output by the i-th intermediate layer of the first decoding subnetwork.
  • the decoding network further includes an inverse quantization unit, and the decoding unit 11 is specifically configured to input the initial feature information into the inverse quantization unit to obtain dequantized feature information;
  • the dequantized feature information is input into the first decoding sub-network to obtain the first feature information output by the i-th intermediate layer of the first decoding sub-network.
  • the decoding network further includes a second decoding sub-network
  • the decoding unit 11 is further configured to input the decoding point probability distribution code stream of the current image into the second decoding sub-network to obtain the Probability distribution of the decoding points of the current image; input the feature code stream of the current image and the probability distribution of the decoding points of the current image into the decoding unit to obtain the initial features of the current image output by the decoding unit information.
  • the decoding network and the task analysis network perform end-to-end training together.
  • the decoding network and the encoding network perform end-to-end training together.
  • the encoding network, the decoding network and the task analysis network perform end-to-end training together.
  • the target loss of the encoding network, the decoding network, and the task analysis network during training is based on the bit rate of the feature information code stream output by the encoding network, and the probability distribution code stream of decoding points. At least one of a bit rate and a task analysis result loss of the task analysis network is determined.
  • the target loss is the product of the preset parameter and the loss of the task analysis result, the sum of the bit rate of the feature information code stream and the bit rate of the decoding point probability distribution code stream.
  • the preset parameters are related to a network model of at least one of the decoding network and the task analysis network.
  • the i-th intermediate layer and the j-th intermediate layer are the two intermediate layers with the highest feature similarity and/or the smallest model loss among the decoding network and the task analysis network.
  • the feature similarity between the i-th intermediate layer and the j-th intermediate layer includes at least one of the following: the feature map output by the i-th intermediate layer and the j-th intermediate The similarity between the feature maps input by layers, the similarity between the feature size output by the i-th intermediate layer and the feature size input by the j-th intermediate layer, the feature output by the i-th intermediate layer The similarity between the statistical histogram of the graph and the statistical histogram of the feature map input by the jth intermediate layer.
  • the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, details are not repeated here.
  • the video decoder 10 shown in FIG. 16 may correspond to a corresponding body that executes the decoding method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the video decoder 10 are for realizing the decoding method For the sake of brevity, the corresponding processes in each method are not repeated here.
  • Fig. 17 is a schematic block diagram of a video encoder provided by an embodiment of the present application.
  • the video encoder 20 includes:
  • the encoding unit 22 is configured to input the current image into the encoding network to obtain the feature code stream output by the encoding network, wherein during model training, the encoding network and the decoding network perform end-to-end training together, and the decoding
  • the first feature information output by the i-th intermediate layer of the network is input into the j-th intermediate layer of the task analysis network.
  • the encoding network includes a first encoding subnetwork and an encoding unit
  • the encoding unit 22 is specifically configured to input the current image into the first encoding subnetwork to obtain the Initial feature information: inputting the initial feature information into the coding unit to obtain a feature code stream output by the coding unit.
  • the encoding network further includes a quantization unit, and the encoding unit 22 is specifically configured to input the initial feature information into the quantization unit for quantization to obtain quantized feature information; The final characteristic information is input into the coding unit, and the characteristic code stream output by the coding unit is obtained.
  • the encoding network further includes a second encoding subnetwork
  • the encoding unit 22 is also configured to input the initial feature information into the second encoding subnetwork to estimate the probability distribution of decoding points, and obtain the The decoding point probability distribution code stream of the current image output by the second encoding subnetwork.
  • the encoding network, decoding network and task analysis network are trained end-to-end together.
  • the target loss of the encoding network, the decoding network, and the task analysis network during training is based on the bit rate of the feature information code stream output by the encoding network, and the probability distribution code stream of decoding points. At least one of a bit rate and a task analysis result loss of the task analysis network is determined.
  • the target loss is the product of the preset parameter and the loss of the task analysis result, the sum of the bit rate of the feature information code stream and the bit rate of the decoding point probability distribution code stream.
  • the preset parameters are related to a network model of at least one of the decoding network and the task analysis network.
  • the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, details are not repeated here.
  • the video encoder 20 shown in FIG. 17 may correspond to a corresponding body that implements the encoding method of the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the video encoder 20 are for realizing the encoding method For the sake of brevity, the corresponding processes in each method are not repeated here.
  • the functional unit may be implemented in the form of hardware, may also be implemented by instructions in the form of software, and may also be implemented by a combination of hardware and software units.
  • each step of the method embodiment in the embodiment of the present application can be completed by an integrated logic circuit of the hardware in the processor and/or instructions in the form of software, and the steps of the method disclosed in the embodiment of the present application can be directly embodied as hardware
  • the decoding processor is executed, or the combination of hardware and software units in the decoding processor is used to complete the execution.
  • the software unit may be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, and registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps in the above method embodiments in combination with its hardware.
  • Fig. 18 is a schematic block diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 30 may be the video encoder or video decoder described in the embodiment of the present application, and the electronic device 30 may include:
  • a memory 33 and a processor 32 the memory 33 is used to store a computer program 34 and transmit the program code 34 to the processor 32 .
  • the processor 32 can call and run the computer program 34 from the memory 33 to implement the method in the embodiment of the present application.
  • the processor 32 can be used to execute the steps in the above method according to the instructions in the computer program 34 .
  • the processor 32 may include, but is not limited to:
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the memory 33 includes but is not limited to:
  • non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash.
  • the volatile memory can be Random Access Memory (RAM), which acts as external cache memory.
  • RAM Static Random Access Memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Synchronous Dynamic Random Access Memory Synchronous Dynamic Random Access Memory
  • SDRAM double data rate synchronous dynamic random access memory
  • Double Data Rate SDRAM, DDR SDRAM double data rate synchronous dynamic random access memory
  • Enhanced SDRAM, ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory
  • Direct Rambus RAM Direct Rambus RAM
  • the computer program 34 can be divided into one or more units, and the one or more units are stored in the memory 33 and executed by the processor 32 to complete the present application.
  • the one or more units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 34 in the electronic device 30 .
  • the electronic device 30 may also include:
  • a transceiver 33 the transceiver 33 can be connected to the processor 32 or the memory 33 .
  • the processor 32 can control the transceiver 33 to communicate with other devices, specifically, can send information or data to other devices, or receive information or data sent by other devices.
  • Transceiver 33 may include a transmitter and a receiver.
  • the transceiver 33 may further include antennas, and the number of antennas may be one or more.
  • bus system includes not only a data bus, but also a power bus, a control bus and a status signal bus.
  • Fig. 19 is a schematic block diagram of a video codec system provided by an embodiment of the present application.
  • the video codec system 40 may include: a video encoder 41 and a video decoder 42, wherein the video encoder 41 is used to execute the video encoding method involved in the embodiment of the present application, and the video decoder 42 is used to execute The video decoding method involved in the embodiment of the present application.
  • the present application also provides a code stream, which is generated by the above encoding method.
  • the present application also provides a computer storage medium, on which a computer program is stored, and when the computer program is executed by a computer, the computer can execute the methods of the above method embodiments.
  • the embodiments of the present application further provide a computer program product including instructions, and when the instructions are executed by a computer, the computer executes the methods of the foregoing method embodiments.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, server, or data center by wire (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website site, computer, server or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a digital video disc (digital video disc, DVD)), or a semiconductor medium (such as a solid state disk (solid state disk, SSD)), etc.
  • a magnetic medium such as a floppy disk, a hard disk, or a magnetic tape
  • an optical medium such as a digital video disc (digital video disc, DVD)
  • a semiconductor medium such as a solid state disk (solid state disk, SSD)
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or can be Integrate into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请提供一种视频编解码方法、编码器、解码器及存储介质,通过将当前图像的特征码流输入解码网络中,得到解码网络的第i个中间层输出的第一特征信息,i为正整数;将第一特征信息输入任务分析网络的第j个中间层中,得到任务分析网络输出的任务分析结果,j为正整数。本申请将解码网络中间层输出的特征信息输入任务分析网络中,使得任务分析网络基于解码网络输出的特征信息进行任务分析,节省了任务分析所占用的时间和计算资源,提高了任务分析的效率。

Description

视频编解码方法、编码器、解码器及存储介质 技术领域
本申请涉及视频编解码技术领域,尤其涉及一种视频编解码方法、编码器、解码器及存储介质。
背景技术
数字视频技术可以并入多种视频装置中,例如数字电视、智能手机、计算机、电子阅读器或视频播放器等。随着视频技术的发展,视频数据所包括的数据量较大,为了便于视频数据的传输,视频装置执行视频压缩技术,以使视频数据更加有效的传输或存储。
随着视觉分析技术的快速发展,将神经网络技术与图像视频压缩技术相结合,提出了面向机器视觉的视频编码框架。
但是,目前的基于神经网络的先压缩后分析的模型中,计算量大,耗时长。
发明内容
本申请实施例提供了一种视频编解码方法、编码器、解码器及存储介质,以节省任务分析的时间和计算量,进而提高任务分析的效率。
第一方面,本申请提供了一种视频编码方法,包括:
将当前图像的特征码流输入解码网络中,得到所述解码网络的第i个中间层输出的第一特征信息,所述i为正整数;
将所述第一特征信息输入任务分析网络的第j个中间层中,得到所述任务分析网络输出的任务分析结果,所述j为正整数。
第二方面,本申请实施例提供一种视频解码方法,包括:
获取待编码的当前图像;
将所述当前图像输入编码网络中,得到所述编码网络输出的特征码流;
其中,在模型训练时,所述编码网络和解码网络一起进行端到端训练,所述解码网络的第i个中间层输出的第一特征信息输入任务分析网络的第j个中间层中。
第三方面,本申请提供了一种视频编码器,用于执行上述第一方面或其各实现方式中的方法。具体地,该编码器包括用于执行上述第一方面或其各实现方式中的方法的功能单元。
第四方面,本申请提供了一种视频解码器,用于执行上述第二方面或其各实现方式中的方法。具体地,该解码器包括用于执行上述第二方面或其各实现方式中的方法的功能单元。
第五方面,提供了一种视频编码器,包括处理器和存储器。该存储器用于存储计算机程序,该处理器用于调用并运行该存储器中存储的计算机程序,以执行上述第一方面或其各实现方式中的方法。
第六方面,提供了一种视频解码器,包括处理器和存储器。该存储器用于存储计算机程序,该处理器用于调用并运行该存储器中存储的计算机程序,以执行上述第二方面或其各实现方式中的方法。
第七方面,提供了一种视频编解码系统,包括视频编码器和视频解码器。视频编码器用于执行上述第一方面或其各实现方式中的方法,视频解码器用于执行上述第二方面、或其各实现方式中的方法。
第八方面,提供了一种芯片,用于实现上述第一方面至第二方面中的任一方面或其各实现方式中的方法。具体地,该芯片包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有该芯片的设备执行如上述第一方面至第二方面中的任一方面或其各实现方式中的方法。
第九方面,提供了一种计算机可读存储介质,用于存储计算机程序,该计算机程序使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。
第十方面,提供了一种计算机程序产品,包括计算机程序指令,该计算机程序指令使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。
第十一方面,提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。
第十二方面,提供了一种码流,该码流通过上述第二方面所述的方法生成。
基于以上技术方案,通过将当前图像的特征码流输入解码网络中,得到解码网络的第i个中间层输出的第一特征信息,i为正整数;将第一特征信息输入任务分析网络的第j个中间层中,得到任务分析网络输出的任务分析结果,j为正整数。本申请将解码网络中间层输出的特征信息输入任务分析网络中,使得任务分析网络基于解码网络输出的特征信息进行任务分析,节省了任务分析所占用的时间和计算资源,进而提高任务分析的效率。
附图说明
图1为本申请实施例涉及的一种视频编解码系统的示意性框图;
图2为对图像先压缩后分析的流程示意图;
图3是本申请实施例涉及的视频编码器的示意性框图;
图4是本申请实施例涉及的视频解码器的示意性框图;
图5A为端到端编解码网络的流程示意图;
图5B为Cheng编解码网络的划分示意图;
图5C为Lee编解码网络的划分示意图;
图5D为Hu编解码网络的划分示意图;
图6A为任务分析网络的流程示意图;
图6B为目标识别网络的划分示意图;
图6C为目标检测网络的划分示意图;
图7为本申请实施例提供的视频解码方法的流程示意图;
图8A为本申请一实施例涉及的网络模型示意图;
图8B为本申请一实施例涉及的一解码网络的示意图;
图8C为本申请一实施例涉及的另一解码网络的示意图;
图8D为本申请一实施例涉及的另一解码网络的示意图;
图9A为本申请一实施例涉及的目标检测网络示意图;
图9B为图9A中的部分网络的示意图;
图9C为Cheng2020网络中部分组件的网络示意图;
图9D为端到端编解码网络和任务分析网络的网络示例图;
图9E为端到端编解码网络与任务分析网络的一种连接示意图;
图9F为端到端编解码网络与任务分析网络的另一种连接示意图;
图9G为端到端编解码网络与任务分析网络的另一种连接示意图;
图9H为本申请实施例涉及的另一种模型示意图;
图10A为端到端编解码网络与任务分析网络的一种连接示意图;
图10B为端到端编解码网络与任务分析网络的另一种连接示意图;
图10C为端到端编解码网络与任务分析网络的另一种连接示意图;
图11为本申请一实施例提供的视频解码方法流程示意图;
图12为本申请一实施例涉及的解码网络和任务分析网络的结构示意图;
图13为本申请一实施例提供的视频编码方法的流程示意图;
图14A为本申请涉及的编码网络的一种结构示意图;
图14B为本申请涉及的编码网络的另一种结构示意图;
图14C为本申请涉及的编码网络的另一种结构示意图;
图14D为本申请涉及的编码网络的一种模型示意图;
图14E为编码网络中的注意力模型的网络示意图;
图15为通用端到端编解码网络示意图;
图16是本申请实施例提供的视频解码器的示意性框图;
图17是本申请实施例提供的视频编码器的示意性框图;
图18是本申请实施例提供的电子设备的示意性框图;
图19是本申请实施例提供的视频编解码系统的示意性框图。
具体实施方式
本申请可应用于面向机器视觉以及人机混合视觉的各类视频编解码领域,将5G、AI、深度学习、特征提取与视频分析等技术与现有视频处理、编码技术相结合。5G时代催生出面向机器的海量应用,如车联网、无人驾驶、工业互联网、智慧与平安城市、可穿戴、视频监控等机器视觉内容,相比日趋饱和的面向人类视频,应用场景更为广泛,面向机器视觉的视频编码将成为5G和后5G时代的主要增量流量来源之一。
例如,本申请的方案可结合至音视频编码标准(audio video coding standard,简称AVS),例如,H.264/音视频编码(audio video coding,简称AVC)标准,H.265/高效视频编码(high efficiency video coding,简称HEVC)标准以及H.266/多功能视频编码(versatile video coding,简称VVC)标准。或者,本申请的方案可结合至其它专属或行业标准而操作,所述标准包含ITU-TH.261、ISO/IECMPEG-1Visual、ITU-TH.262或ISO/IECMPEG-2Visual、ITU-TH.263、ISO/IECMPEG-4Visual,ITU-TH.264(还称为ISO/IECMPEG-4AVC),包含可分级视频编解码(SVC)及多视图视频编解码(MVC)扩展。应理解,本申请的技术不限于任何特定编解码标准或技术。
图1为本申请实施例涉及的一种视频编解码系统的示意性框图。需要说明的是,图1只是一种示例,本申请实施例的视频编解码系统包括但不限于图1所示。如图1所示,该视频编解码系统100包含编码设备110和解码设备120。其中编码设备用于对视频数据进行编码(可以理解成压缩)产生码流,并将码流传输给解码设备。解码设备对编码设备编码产生的码流进行解码,得到解码后的视频数据。
本申请实施例的编码设备110可以理解为具有视频编码功能的设备,解码设备120可以理解为具有视频解码功能的设备,即本申请实施例对编码设备110和解码设备120包括更广泛的装置,例如包含智能手机、台式计算机、移动计算装置、笔记本(例如,膝上型)计算机、平板计算机、机顶盒、电视、相机、显示装置、数字媒体播放器、视频游戏控制台、车载计算机等。
在一些实施例中,编码设备110可以经由信道130将编码后的视频数据(如码流)传输给解码设备120。信道130可以包括能够将编码后的视频数据从编码设备110传输到解码设备120的一个或多个媒体和/或装置。
在一个实例中,信道130包括使编码设备110能够实时地将编码后的视频数据直接发射到解码设备120的一个或多个通信媒体。在此实例中,编码设备110可根据通信标准来调制编码后的视频数据,且将调制后的视频数据发射到解码设备120。其中通信媒体包含无线通信媒体,例如射频频谱,可选的,通信媒体还可以包含有线通信媒体,例如一根或多根物理传输线。
在另一实例中,信道130包括存储介质,该存储介质可以存储编码设备110编码后的视频数据。存储介质包含多 种本地存取式数据存储介质,例如光盘、DVD、快闪存储器等。在该实例中,解码设备120可从该存储介质中获取编码后的视频数据。
在另一实例中,信道130可包含存储服务器,该存储服务器可以存储编码设备110编码后的视频数据。在此实例中,解码设备120可以从该存储服务器中下载存储的编码后的视频数据。可选的,该存储服务器可以存储编码后的视频数据且可以将该编码后的视频数据发射到解码设备120,例如web服务器(例如,用于网站)、文件传送协议(FTP)服务器等。
一些实施例中,编码设备110包含视频编码器112及输出接口113。其中,输出接口113可以包含调制器/解调器(调制解调器)和/或发射器。
在一些实施例中,编码设备110除了包括视频编码器112和输入接口113外,还可以包括视频源111。
视频源111可包含视频采集装置(例如,视频相机)、视频存档、视频输入接口、计算机图形系统中的至少一个,其中,视频输入接口用于从视频内容提供者处接收视频数据,计算机图形系统用于产生视频数据。
视频编码器112对来自视频源111的视频数据进行编码,产生码流。视频数据可包括一个或多个图像(picture)或图像序列(sequence of pictures)。码流以比特流的形式包含了图像或图像序列的编码信息。编码信息可以包含编码图像数据及相关联数据。相关联数据可包含序列参数集(sequence parameter set,简称SPS)、图像参数集(picture parameter set,简称PPS)及其它语法结构。SPS可含有应用于一个或多个序列的参数。PPS可含有应用于一个或多个图像的参数。语法结构是指码流中以指定次序排列的零个或多个语法元素的集合。
视频编码器112经由输出接口113将编码后的视频数据直接传输到解码设备120。编码后的视频数据还可存储于存储介质或存储服务器上,以供解码设备120后续读取。
在一些实施例中,解码设备120包含输入接口121和视频解码器122。
在一些实施例中,解码设备120除包括输入接口121和视频解码器122外,还可以包括显示装置123。
其中,输入接口121包含接收器及/或调制解调器。输入接口121可通过信道130接收编码后的视频数据。
视频解码器122用于对编码后的视频数据进行解码,得到解码后的视频数据,并将解码后的视频数据传输至显示装置123。
显示装置123显示解码后的视频数据。显示装置123可与解码设备120整合或在解码设备120外部。显示装置123可包括多种显示装置,例如液晶显示器(LCD)、等离子体显示器、有机发光二极管(OLED)显示器或其它类型的显示装置。
此外,图1仅为实例,本申请实施例的技术方案不限于图1,例如本申请的技术还可以应用于单侧的视频编码或单侧的视频解码。
神经网络源于认知神经科学与数学的交叉研究,通过多层交替级联神经元与非线性激活函数构建的多层感知机(multi-layer perceptron,MLP)结构能够以足够小的误差实现对任意连续函数的逼近。神经网络的学习方法经历了由19世纪60年代提出感知机学习算法,到19世纪80年代通过链式法则和反向传播算法建立的MLP学习过程,再到如今被广泛使用的随机梯度下降方法。为了解决时域信号梯度计算复杂度过高以及信号依赖问题,提出长短期记忆(long short-term memory,LSTM)结构,通过循环网络结构控制梯度传递实现对序列信号的高效学习。通过对受限玻兹曼机(restricted Boltzmann machine,RBM)的每一层进行分层预训练,使得深层的神经网络训练变得可能。在解释了MLP结构具有更为优异的特征学习能力的同时,MLP在训练上的复杂度还可以通过逐层初始化和预训练来有效缓解。从此具有多隐含层的MLP结构研究再次成为热点,而神经网络也有了一个新的名称——深度学习(deep learning,DL)。
神经网络作为优化算法以及信号紧凑表征形式,可以与图像视频压缩相结合。
随着机器学习算法的发展和普及,基于深度学习的端到端图像/视频编解码器通过采用深度神经网络辅助编码工具,借助深度学习方法中的分层模型架构和大规模的数据先验信息,获得了比传统编解码器更优的性能。
通常用于图像压缩的编解码器和用于任务分析的智能任务网络是分开设计和优化的,流程如图2所示,在对压缩后的图像执行智能任务分析时,需要将不同的编解码器解码重建后的图像输入到智能任务网络中进行任务分析。
用于压缩的编解码框架方法通常可分为:传统混合编码框架;传统混合编码框架的改进(例如使用神经网络来替换传统框架中某几个模块);端到端编解码网络框架。这些压缩方法的输出端都为解码重建后的图像或视频。
在一些实施例中,用于压缩的编解码框架为传统混合编码框架时,可以采用如图3所示的视频编码器,和图4所示的视频解码器。
图3是本申请实施例涉及的视频编码器的示意性框图。应理解,该视频编码器200可用于对图像进行有损压缩(lossy compression),也可用于对图像进行无损压缩(lossless compression)。该无损压缩可以是视觉无损压缩(visually lossless compression),也可以是数学无损压缩(mathematically lossless compression)。
在一些实施例中,如图3所示,该视频编码器200可包括:预测单元210、残差单元220、变换/量化单元230、反变换/量化单元240、重建单元250、环路滤波单元260、解码图像缓存270和熵编码单元280。需要说明的是,视频编码器200可包含更多、更少或不同的功能组件。
可选的,在本申请中,当前块(current block)可以称为当前编码单元(CU)或当前预测单元(PU)等。预测块也可称为预测图像块或图像预测块,重建图像块也可称为重建块或图像重建图像块。
在一些实施例中,预测单元210包括帧间预测单元211和帧内估计单元212。由于视频的一个帧中的相邻像素之间存在很强的相关性,在视频编解码技术中使用帧内预测的方法消除相邻像素之间的空间冗余。由于视频中的相邻帧之间存在着很强的相似性,在视频编解码技术中使用帧间预测方法消除相邻帧之间的时间冗余,从而提高编码效率。
帧间预测单元211可用于帧间预测,帧间预测可以参考不同帧的图像信息,帧间预测使用运动信息从参考帧中找到参考块,根据参考块生成预测块,用于消除时间冗余;帧间预测所使用的帧可以为P帧和/或B帧,P帧指的是向前预测帧,B帧指的是双向预测帧。运动信息包括参考帧所在的参考帧列表,参考帧索引,以及运动矢量。运动矢量可以是整像素的或者是分像素的,如果运动矢量是分像素的,那么需要再参考帧中使用插值滤波做出所需的分像素的块, 这里把根据运动矢量找到的参考帧中的整像素或者分像素的块叫参考块。有的技术会直接把参考块作为预测块,有的技术会在参考块的基础上再处理生成预测块。在参考块的基础上再处理生成预测块也可以理解为把参考块作为预测块然后再在预测块的基础上处理生成新的预测块。
帧内估计单元212只参考同一帧图像的信息,预测当前码图像块内的像素信息,用于消除空间冗余。帧内预测所使用的帧可以为I帧。
帧内预测有多种预测模式,以国际数字视频编码标准H系列为例,H.264/AVC标准有8种角度预测模式和1种非角度预测模式,H.265/HEVC扩展到33种角度预测模式和2种非角度预测模式。HEVC使用的帧内预测模式有平面模式(Planar)、DC和33种角度模式,共35种预测模式。VVC使用的帧内模式有Planar、DC和65种角度模式,共67种预测模式。对于亮度分量有基于训练得到的预测矩阵(Matrix based intra prediction,MIP)预测模式,对于色度分量,有CCLM预测模式。
需要说明的是,随着角度模式的增加,帧内预测将会更加精确,也更加符合对高清以及超高清数字视频发展的需求。
残差单元220可基于CU的像素块及CU的PU的预测块来产生CU的残差块。举例来说,残差单元220可产生CU的残差块,使得残差块中的每一采样具有等于以下两者之间的差的值:CU的像素块中的采样,及CU的PU的预测块中的对应采样。
变换/量化单元230可量化变换系数。变换/量化单元230可基于与CU相关联的量化参数(QP)值来量化与CU的TU相关联的变换系数。视频编码器200可通过调整与CU相关联的QP值来调整应用于与CU相关联的变换系数的量化程度。
反变换/量化单元240可分别将逆量化及逆变换应用于量化后的变换系数,以从量化后的变换系数重建残差块。
重建单元250可将重建后的残差块的采样加到预测单元210产生的一个或多个预测块的对应采样,以产生与TU相关联的重建图像块。通过此方式重建CU的每一个TU的采样块,视频编码器200可重建CU的像素块。
环路滤波单元260可执行消块滤波操作以减少与CU相关联的像素块的块效应。
在一些实施例中,环路滤波单元260包括去块滤波单元和样点自适应补偿/自适应环路滤波(SAO/ALF)单元,其中去块滤波单元用于去方块效应,SAO/ALF单元用于去除振铃效应。
解码图像缓存270可存储重建后的像素块。帧间预测单元211可使用含有重建后的像素块的参考图像来对其它图像的PU执行帧间预测。另外,帧内估计单元212可使用解码图像缓存270中的重建后的像素块来对在与CU相同的图像中的其它PU执行帧内预测。
熵编码单元280可接收来自变换/量化单元230的量化后的变换系数。熵编码单元280可对量化后的变换系数执行一个或多个熵编码操作以产生熵编码后的数据。
图4是本申请实施例涉及的视频解码器的示意性框图。
如图4所示,视频解码器300包含:熵解码单元310、预测单元320、逆量化变换单元330、重建单元340、环路滤波单元350及解码图像缓存360。需要说明的是,视频解码器300可包含更多、更少或不同的功能组件。
视频解码器300可接收码流。熵解码单元310可解析码流以从码流提取语法元素。作为解析码流的一部分,熵解码单元310可解析码流中的经熵编码后的语法元素。预测单元320、逆量化变换单元330、重建单元340及环路滤波单元350可根据从码流中提取的语法元素来解码视频数据,即产生解码后的视频数据。
在一些实施例中,预测单元320包括帧内估计单元321和帧间预测单元322。
帧内估计单元321(也称为帧内预测单元)可执行帧内预测以产生PU的预测块。帧内估计单元321可使用帧内预测模式以基于空间相邻PU的像素块来产生PU的预测块。帧内估计单元321还可根据从码流解析的一个或多个语法元素来确定PU的帧内预测模式。
帧间预测单元322可根据从码流解析的语法元素来构造第一参考图像列表(列表0)及第二参考图像列表(列表1)。此外,如果PU使用帧间预测编码,则熵解码单元310可解析PU的运动信息。帧间预测单元322可根据PU的运动信息来确定PU的一个或多个参考块。帧间预测单元322可根据PU的一个或多个参考块来产生PU的预测块。
逆量化变换单元330(也称为反量化/变换单元)可逆量化(即,解量化)与TU相关联的变换系数。逆量化变换单元330可使用与TU的CU相关联的QP值来确定量化程度。
在逆量化变换系数之后,逆量化变换单元330可将一个或多个逆变换应用于逆量化变换系数,以便产生与TU相关联的残差块。
重建单元340使用与CU的TU相关联的残差块及CU的PU的预测块以重建CU的像素块。例如,重建单元340可将残差块的采样加到预测块的对应采样以重建CU的像素块,得到重建图像块。
环路滤波单元350可执行消块滤波操作以减少与CU相关联的像素块的块效应。
视频解码器300可将CU的重建图像存储于解码图像缓存360中。视频解码器300可将解码图像缓存360中的重建图像作为参考图像用于后续预测,或者,将重建图像传输给显示装置呈现。
由上述图3和图4可知,视频编解码的基本流程如下:在编码端,将一帧图像划分成块,对当前块,预测单元210使用帧内预测或帧间预测产生当前块的预测块。残差单元220可基于预测块与当前块的原始块计算残差块,例如将当前块的原始块减去预测块得到残差块,该残差块也可称为残差信息。该残差块经由变换/量化单元230变换与量化等过程,可以去除人眼不敏感的信息,以消除视觉冗余。可选的,经过变换/量化单元230变换与量化之前的残差块可称为时域残差块,经过变换/量化单元230变换与量化之后的时域残差块可称为频率残差块或频域残差块。熵编码单元280接收到变换量化单元230输出的量化后的变换系数,可对该量化后的变换系数进行熵编码,输出码流。例如,熵编码单元280可根据目标上下文模型以及二进制码流的概率信息消除字符冗余。
在解码端,熵解码单元310可解析码流得到当前块的预测信息、量化系数矩阵等,预测单元320基于预测信息对当前块使用帧内预测或帧间预测产生当前块的预测块。逆量化变换单元330使用从码流得到的量化系数矩阵,对量化 系数矩阵进行反量化、反变换得到残差块。重建单元340将预测块和残差块相加得到重建块。重建块组成重建图像,环路滤波单元350基于图像或基于块对重建图像进行环路滤波,得到解码图像。编码端同样需要和解码端类似的操作获得解码图像。该解码图像也可以称为重建图像,重建图像可以为后续的帧作为帧间预测的参考帧。
需要说明的是,编码端确定的块划分信息,以及预测、变换、量化、熵编码、环路滤波等模式信息或者参数信息等在必要时携带在码流中。解码端通过解析码流及根据已有信息进行分析确定与编码端相同的块划分信息,预测、变换、量化、熵编码、环路滤波等模式信息或者参数信息,从而保证编码端获得的解码图像和解码端获得的解码图像相同。
上述是基于块的混合编码框架下的视频编解码器的基本流程,随着技术的发展,该框架或流程的一些模块或步骤可能会被优化,本申请适用于该基于块的混合编码框架下的视频编解码器的基本流程,但不限于该框架及流程。
在一些实施例中,若用于压缩的编解码框架为传统混合编码框架的改进时,可以通过如下几种示例所示的方法对传统混合编码框架进行改进。
示例一,使用一种基于超分辨率卷积神经网络(Super Resolution Convolutional Neural Network,SRCNN)的分像素插值滤波器,用于HEVC的半像素运动补偿。
示例二,使用一种新的全连接网络IPFCN(Intra Prediction using Full connected Network,使用全连接网络的帧内预测)来进行HEVC的帧内预测,将帧内预测的参考像素展开作为向量的输入,从而预测当前块的像素值。
示例三,使用卷积神经网络来用于帧内编码加速,用网络对不同深度CU分类的方法来预测帧内编码的CU分割方法,从而代替传统HEVC率失真优化方法中遍历不同划分的方式。
在一些实施例中,用于压缩的编解码框架可以为端到端编解码网络框架。
传统的混合编解码框架或基于传统混合编码改进的方法都是分多个模块进行编解码的,每个模块主要依赖于模块内不同模型的优化从得到最优的率失真解。这种没有考虑模块间联动的方法常常会导致率失真优化陷入局部最优解。近年来端到端压缩网络的发展和广泛使用很大程度上缓解了这种分模块优化方法的弊端,通过训练网络的传输来实现率失真最优化,更显示地计算整体模型的率失真损失。
在一种示例中,该端到端编解码网络为基于循环神经网络(Recurrent Neural Network,RNN)的编解码网络。该方法将图像输入到一个多轮共享的循环神经网络中,将每轮输出的重建残差作为下一轮循环神经网络的输入,通过控制循环的次数来控制码率,从而获得可伸缩编码的效果。
在一些实施例中,该端到端编解码网络为基于卷积神经网络(Convolution Neural Network,CNN)的端到端图像编码网络。在该网络中采用了广义除法归一化激活函数,并对网络输出的变换函数系数采用均匀量化,在训练中通过添加均匀噪声来模拟量化过程,从而解决网络训练中量化不可导问题。可选的,可以采用高斯尺度混合(Gaussian Scale Mixture,GSM)超先验模型,来替换全分解模型来建模。
在一些实施例中,可以使用高斯混合(Gaussian Mixture Model,GMM)超先验模型来替代GSM,并使用了基于PixelCNN结构的自回归上下文条件概率模型来降低码率,提高建模精度。
在一些实施例中,该端到端编解码网络为Lee编解码网络,Lee编解码网络采用迁移学习的方法提升网络重建图像的质量。
在一些实施例中,该端到端编解码网络为Hu编解码网络,Hu编解码网络通过利用不同任务之间的内在可迁移性,成功地以低比特率构建了紧凑和表达性的表示,用以支持包括高级语义相关任务和中级几何解析任务等多样化的机器视觉任务集。编解码网络通过使用高层的语义图增强低级的视觉特征,并且验证了这种方法可以有效提升图像压缩的码率、精度和失真表现。
面向智能分析的应用场景下,视频及图像除了需要呈现给用户高质量地观看以外,还更多地被用于分析理解其中的语义信息。
本申请实施例涉及的智能任务网络包括但不限于目标识别网络、目标检测网络和实例分割网络等。
在一些实施例中,端到端编解码网络通常首先使用神经网络来压缩图像/视频,然后将压缩后的码流传输到解码器中,最后在解码端解码重建图像/视频。可选的,端到端编解码网络的流程如图5A所示,其中E1、E2模块构成端到端编解码网络的编码端,D2、D1模块构成端到端编解码网络的解码端。其中E1模块为特征提取网络,从图像中提取特征;E2模块为特征编码模块,继续提取特征并将提取的特征编码成码流;D2模块为特征解码模块,将码流解码还原成特征并重建到低层的特征;D1模块为解码网络,从D2重建的特征重建图像。
示例性的,若上述端到端编解码网络为如图5B所示的Cheng编解码网络时,E1模块、E2模块、D1模块和D2模块的划分方式如图5B虚线框所示。图5B中Conv为卷积的缩写,表示卷积层。
示例性的,若上述端到端编解码网络为如图5C所示的Lee编解码网络时,E1模块、E2模块、D1模块和D2模块的划分方式如图5C虚线框所示。图5C中FCN(Fully Convolutional Networks)表示全连接层、ReLU(Rectified Linear Unit,线性整流函数)为一种激活函数,leaky ReLU为leaky激活函数,abs表示求绝对值,exp表示e的次幂函数。
示例性的,若上述端到端编解码网络为如图5D所示的Hu编解码网络时,E1模块、E2模块、D1模块和D2模块的划分方式如图5D虚线框所示。
需要说明的是,上述各模块的划分方式只是一种示例,可以根据实际情况进行灵活划分。
在一些实施例中,智能任务网络是对输入的图像/视频内容进行智能任务分析,包括但不限于目标识别、实例分割等任务。可选的,智能任务网络的流程如图6A所示,其中A1模块为特征提取网络,用于从重建图像/视频中提取低层特征。A2模块为智能分析网络,继续提取特征并对提取的特征进行智能分析。
示例性的,若上述智能任务网络为如图6B所示的目标识别网络yolo_v3(you only look once_version3,只看一次部分3)时,A1模块和A2模块的划分方式如图6B虚线框所示。
示例性的,若上述智能任务网络为如图6C所示的目标检测网络ResNet-FPN(Residual Networks-Feature Pyramid Networks,残差-特征金字塔网络)时,A1模块和A2模块的划分方式如图6C虚线框所示。
示例性的,可选的上述智能任务网络还可以是实例分割网络Mask RCNN(Mask Region-CNN,基于掩膜的RCNN)。
目前当面临大量的数据和智能分析任务时,对图像先压缩存储,再解压进行分析的方法,任务分析均是基于图像的,也就是说解码网络重建图像,将重建图像输入任务分析网络进行任务分析,造成任务分析耗时长,计算量大,效率低。
为了解决上述技术问题,本申请实施例将解码网络中间层输出的特征信息输入任务分析网络中,使得任务分析网络基于解码网络输出的特征信息进行任务分析,节省了任务分析所占用的时间和计算资源,进而提高了任务分析的效率。
下面结合具体的示例,对本申请实施例涉及的视频解码方法进行详细描述。
首先以解码端为例,对图像解码过程进行介绍。
图7为本申请实施例提供的视频解码方法的流程示意图。本申请实施例的执行主体可以理解为图1所示的解码器,如图7所示,包括:
S701、将当前图像的特征码流输入解码网络中,得到解码网络的第i个中间层输出的第一特征信息,i为正整数;
S702、将第一特征信息输入任务分析网络的第j个中间层中,得到任务分析网络输出的任务分析结果,j为正整数。
图8A为本申请一实施例涉及的网络模型示意图,如图8A所示,该网络模型包括解码网络和任务分析网络,其中解码网络的第i个中间层的输出端与任务分析网络的第j个中间层的输入端连接,这样,解码网络的第i个中间层输出的第一特征信息可作为任务分析网络的第j个中间层的输入,进而使得任务分析网络可以根据第j个中间层所输入的特征信息进行任务分析。
如图8A所示,本申请实施例与解码网络通过解码出所有层的特征信息后,进行图像重建,并将重建图像输入任务分析网络,使得任务分析网络基于重建图像进行任务分析相比,只需解码出部分特征信息,例如解码出第i个中间层的特征信息,而无需解码出所有层的特征信息,也无需重建图像,进而节省了任务分析所占用的时间和计算资源,提高任务分析的效率。
本申请实施例对解码网络的具体网络结构不做限制。
在一些实施例中,上述解码网络可以为单独的神经网络。在模型训练时,该解码网络进行单独训练。
在一些实施例中,上述解码网络为端到端编解码网络中的解码部分。在模型训练时,该端到端编解码网络中,解码部分和编码部分一起进行端到端的训练。其中端到端编解码网络也称为自编码器。
在一些实施例中,如图8B所示,解码网络包括解码单元和第一解码子网络,该解码单元用于对特征码流进行解码,第一解码子网络用于对解码单元解码出的特征信息进行特征再提取,以重建图像。在该实施例中,解码单元可以理解为熵解码单元,可以对特征码流进行熵解码,得到当前图像的初始特征信息,该解码单元可以为神经网络。上述第i个中间层为第一解码子网络中除输出层之外的其他层,即第i个中间层为第一解码子网络的输入层或任一中间层。
基于图8B,则上述S701包括如下S701-A和S701-B:
S701-A、将当前图像的特征码流输入解码单元中,得到解码单元输出的当前图像的初始特征信息;
S701-B、将初始特征信息输入第一解码子网络中,得到第一解码子网络的第i个中间层输出的第一特征信息。
在一些实施例中,如图8C所示,解码网络除了包括解码单元和第一解码子网络之外,还可以包括反量化单元,此时,上述S701-B包括如下S701-B1和S701-B2的步骤:
S701-B1、将初始特征信息输入反量化单元中,得到反量化后的特征信息;
S701-B2、将反量化后的特征信息输入第一解码子网络中,得到第一解码子网络的第i个中间层输出的第一特征信息。
即如图8C所示,解码网络中的解码单元对特征码流进行解码,得到初始特征信息,该初始特征信息在编码网络中经过了量化,因此,解码网络需要对该初始特征信息进行反量化,具体是,将该初始特征信息输入反量化单元中进行反量化,得到反量化后的特征信息,再将反量化后的特征信息输入第一解码子网络中,得到第一解码子网络的第i个中间层输出的第一特征信息。
在一些实施例中,编码网络在对当前图像进行编码时,不仅对当前图像的特征信息进行编码,形成特征码流,还估计了当前图像的解码点的出现概率分布,并对解码点的概率分布进行编码,形成当前图像的解码点概率分布码流(也称为概率估计码流)。这样,解码网络除了对特征码流进行解码外,还需要对解码点概率分布码流进行解码。
基于此,如图8D所示,解码网络还包括第二解码子网络,该第二解码子网络用于对解码点概率分布码流进行解码。此时,本申请实施例还包括:将当前图像的解码点概率分布码流输入第二解码子网络中,得到当前图像的解码点的概率分布。对应的,上述S701-A包括:将当前图像的特征码流和当前图像的解码点的概率分布输入解码单元中,得到解码单元输出的当前图像的初始特征信息。
可选的,上述第二解码子网络可以为超先验网络。
本申请实施例对任务分析网络的具体网络结构不做限制。
可选的,任务分析网络可以为目标识别网络、目标检测网络、实例分割网络、分类网络等。
本申请实施例对第i个中间层和第j个中间层的具体选择不做限制。
在一些实施例中,上述第i个中间层可以为解码网络中除输入层和输出层之外的任意一个中间层,第j个中间层可以为解码网络中除输入层和输出层之外的任意一个中间层。
在一些实施例中,第i个中间层和第j个中间层为解码网络和任务分析网络中,特征相似度最高和/或模型损失最小的两个中间层。
示例性的,特征相似度的计算过程可以是:在网络模型搭建阶段,将图像A输入编码网络中,得到图像A的码流,将图像A的码流输入解码网络中,得到解码网络的各中间层输出的特征信息,以及图像A的重建图像。将重建图 像输入任务分析网络中,得到任务分析网络的各中间层输入的特征信息。接着,计算解码网络的各中间输出的特征信息和任务分析网络的各中间层输入的特征信息中两两特征信息之间的相似度。
举例说明,以端到端编解码网络为Cheng2020网络,任务分析网络为图9A所示的目标检测网络为例,该目标检测网络也称为Faster RCNN R50C4(faster regions with conventional neural network Resnet50Conv4,更快速的区域卷积神经网络R50C4)网络。如图9A所示,该目标检测网络包括骨干网络ResNet50-C4(Residual Networks50-C4,残差网络50-C4)、RPN(Region proposal network,区域提取网络)和ROI-Heads(Region of interest_Heads,感兴趣区域头),其中骨干网络ResNet50-C4包括4层,分别为Conv1、Conv2_X、Conv 3_X和Conv4_X。其中Conv为卷积(convolution)的缩写。示例性的,Conv1包括至少一个卷积层,Conv2_X包括最大池化层、BTINK(Bottle Neck,瓶颈)1和2个BTINK2,Conv3_X包括一个BTINK1和3个BTINK2,Conv4_X包括一个BTINK1和5个BTINK2。可选的,BTINK1和BTINK2的网络结构如图9B所示,BTINK1包括4个卷积层,BTINK2包括3个卷积层。
在一些实施例中,Cheng2020网络由图9C所示的Enc_GDNM(encoder generalized divisive normalization module,编码器广义除法归一化模块)、Enc_NoGDNM(encoder no generalized divisive normalization module,编码器没有广义的除法归一化模块)、Dec_IGDNM(decoder inverse generalized divisive normalization module,解码器逆广义除法归一化模块)和Dec_NoIGDNM(decoder no inverse generalized divisive normalization module,解码器无逆广义除法归一化模块)组成。
图9D为端到端编解码网络和任务分析网络的网络示例图,其中,端到端编解码网络为Cheng2020,任务分析网络为Faster RCNN R50C4网络。该端到端编解码网络包括编码网络和解码网络,其中,编码网络包括9个网络层,包括节点e0到节点e9,解码网络包括10个网络层,包括节点d10到节点d0。任务分析网络的骨干网络包括4个网络层,包括节点F0到节点F15。如图9D所示,节点e0为编码网络的输入节点,节点d0为解码网络的输出节点,F0为任务分析网络的输入节点,这三个节点对应的数据为图像数据,例如大小为WXHX3的图像,其中,WXH为图像的尺度,3为图像的通道数。
在一些实施例中,图9D所示的网络中各层卷积核大小如表1所示:
Figure PCTCN2021122473-appb-000001
其中,表1中的卷积核“[3×3,N],/2”,3×3为卷积核的大小,N为通道数,/2表示下采样,2为下采样的倍数。表1中的卷积核“[3×3,N]×2”,3×3为卷积核的大小,N为通道数,×2表示卷积核的数量为2个。表1中的卷积核“[3×3,3],*2”,*2表示上采样,2为上采样的倍数。
需要说明的是,上述表1只是一种示例,图9D所示的网络中各层的卷积核包括但不限于上述表1所示。
图9D中编码网络、解码网络和任务分析网络中各节点对应的特征信息的大小如表2所示:
表2
Figure PCTCN2021122473-appb-000002
根据上述表2所示的解码网络和任务分析网络中各中间层的特征信息的大小,计算解码网络的各中间输出的特征信息和任务分析网络的各中间层输入的特征信息中两两特征信息之间的相似度。
示例性的,假设根据上述确定出解码网络中节点d7对应的中间层输出的特征信息与任务分析网络的F9节点对应的中间层输入的特征信息的相似度最高,则如图9E所示,将节点d7对应的中间层作为第i个中间层,将F9对应的中间层作为第j个中间层,进而将节点d7对应的中间层的输出端与F9对应的中间层的输入端连接。
示例性的,假设根据上述确定出解码网络中节点d5对应的中间层输出的特征信息与任务分析网络的F5节点对应的中间层输入的特征信息的相似度最高,则如图9F所示,将节点d5对应的中间层作为第i个中间层,将F5对应的中间层作为第j个中间层,进而将节点d5对应的中间层的输出端与F5对应的中间层的输入端连接。
示例性的,假设根据上述确定出解码网络中节点d2对应的中间层输出的特征信息与任务分析网络的F1节点对应的中间层输入的特征信息的相似度最高,则如图9G所示,将节点d2对应的中间层作为第i个中间层,将F2对应的中间层作为第j个中间层,进而将节点d2对应的中间层的输出端与F1对应的中间层的输入端连接。
可选的,第i个中间层和第j个中间层之间的特征相似度包括如下至少一个:第i个中间层输出的特征图和第j个中间层输入的特征图之间的相似度、第i个中间层输出的特征大小和第j个中间层输入的特征大小之间的相似度、第i个中间层输出的特征图的统计直方图和第j个中间层输入的特征图的统计直方图之间的相似度。
上述特征相似度的计算过程进行了介绍,下面对模型损失的计算过程进行介绍。
继续以图9D所示的模型为例,假设将节点d5和节点F5相连,需要说明的是,这里的节点相连,可以理解为两个中间层相连,例如节点d5为解码网络中一个中间层的输出端,节点F5为任务分析网络的一个中间层的输入端。向图9D所示的模型中输入一张图像B,编码网络对该图像B进行特征编码,得到码流。解码网络对该码流进行解码,得到节点d5的特征信息1,将该特征信息1输入节点F5中进行任务分析,得到任务分析网络基于该特征信息1预测的分类结果1。计算任务分析网络预测的分类结果1与该图像B对应的分类结果真值之间的损失1,根据该损失1确定当前模型的损失。接着,将节点d5与节点F9连接,参照上述过程,计算出节点d5与节点F9连接时,模型的损失。依次类推,根据上述方法,可以计算出解码网络中不同节点与任务分析网络的不同节点连接时,模型的损失。
可选的,可以将模型损失最小时对应的相连的两个中间层(或两个节点),确定为第i个中间层和第j个中间层。
可选的,可以根据两个中间层之间的特征相似度和模型损失,来确定第i个中间层和第j个中间层。例如,根据上述特征相似度的计算方法,计算解码网络的中间层和任务分析网络的中间层之间的特征相似度,以及计算两个中间层连接时模型的损失,将特征相似度和模型损失之和最小的两个中间层,确定为第i个中间层和第j个中间层。
在一些实施例中,为了降低第i个中间层和第j个中间层确定过程中的复杂度,可以通过如下几种示例,来确定出第i个中间层和第j个中间层。
示例1,先从解码网络中随机选择一个中间层作为第i个中间层,将任务分析网络中与该第i个中间层的特征相似度最高的中间层确定为第j个中间层。
示例2,先从解码网络中随机选择一个中间层作为第i个中间层,将任务分析网络中的各中间层分别与该第i个中间层尝试相连后,确定任务分析网络中的不同中间层与解码网络中的第i个中间层连接后,网络模型的模型损失,将最小模型损失对应的中间层,确定为第j个中间层。
示例3,先从解码网络中随机选择一个中间层作为第i个中间层,确定任务分析网络中的各中间层与该第i个中间层的特征相似度,以及确定任务分析网络中的不同中间层与解码网络中的第i个中间层连接后,网络模型的模型损失,确定任务分析网络中的各中间层对应的特征相似度和模型损失的和值,将最小和值对应的中间层,确定为第j个中间层。
示例4,先从任务分析网络中随机选择一个中间层作为第j个中间层,将解码网络中与该第j个中间层的特征相似度最高的中间层确定为第i个中间层。
示例5,先从任务分析网络中随机选择一个中间层作为第j个中间层,将解码网络中的各中间层分别与该第j个中间层尝试相连后,确定解码网络中的不同中间层与任务分析网络中的第j个中间层连接后,网络模型的模型损失,将最小模型损失对应的中间层,确定为第i个中间层。
示例6,先从任务分析网络中随机选择一个中间层作为第j个中间层,确定解码网络中的各中间层与该第j个中间层的特征相似度,以及确定解码网络中的不同中间层与任务分析网络中的第j个中间层连接后,网络模型的模型损失, 确定解码网络中的各中间层对应的特征相似度和模型损失的和值,将最小和值对应的中间层,确定为第i个中间层。
需要说明的是,上述第i个中间层和第j个中间层的确定过程在网络搭建过程执行。
在一些实施例中,如第i个中间层输出的特征信息和第j个中间层的输入特征的大小不同时,还包括特征大小转换的过程。即上述S702包括如下S702-A1和S702-A2:
S702-A1、将第一特征信息输入特征适配器中进行特征适配,得到第二特征信息,该第二特征信息的大小与第j个中间层的预设输入大小一致;
S702-A2、将第二特征信息输入第j个中间层中,得到任务分析网络输出的任务分析结果。
例如,图9H所示,在解码网络的第i个中间层与任务分析网络的第j个中间层之间设置有特征适配器。
其中,第j个中间层的输入端所输入的特征信息的大小可以预先进行设置。
在一些实施例中,该特征适配器可以为神经网络单元,例如包括池化层或卷积层等,将这类特征适配器称为基于神经网络的特征适配器。
在一些实施例中,该特征适配器可以为算法单元,用于执行某一种或几种算数,以实现特征信息大小的转换,将这类特征适配器称为基于非神经网络的特征适配器。
其中,特征信息的大小包括特征信息的尺寸和/或特征信息的通道数。
在一些实施例中,若特征信息的大小包括特征信息的通道数,则上述特征适配器用于通道数的适配。即上述S702-A1中将第一特征信息输入特征适配器中进行特征适配包括如下几种情况:
情况1,若第一特征信息的通道数大于第j个中间层的输入通道数,则通过特征适配器将第一特征信息的通道数减小至与第j个中间层的输入通道数相同的通道。
该情况1中,将第一特征信息的通道数减少至与第j个中间层的输入通道数相同的方式包括但不限于如下几种:
方式一,若特征适配器为基于非神经网络的特征适配器,将第一特征信息输入特征适配器,以使特征适配器采用主成分分析(Principal Component Analysis,PCA)方式或随机选择的方式,从第一特征信息的通道中选出第j个中间层的输入通道数个通道。
例如,第一特征信息的通道数为64,第j个中间层的输入通道数为32,则可以从第一特征信息的64个通道数中随机选出32个通道,输入第j个中间层中。
再例如,将第一特征信息输入特征适配器,以使特征适配器采用PCA方式,从第一特征信息的通道中选择出与第j个中间层的输入通道数相同的主要特征通道。其中,PCA是一种常见的数据分析方式,常用于高维数据的降维,可用于提取数据的主要特征分量。
方式二,若特征适配器为基于神经网络的特征适配器,将第一特征信息输入特征适配器中,通过特征适配器中的至少一个卷积层,将第一特征信息的通道数减小至与第j个中间层的输入通道数相同。可选的,可以通过减少特征适配器中的卷积层的数量,和/或减少卷积核的数量,以减少第一特征信息的通道数。
情况2,若第一特征信息的通道数小于第j个中间层的输入通道数,则通过特征适配器将第一特征信息的通道数增加至与第j个中间层的输入通道数相同。
该情况2中,若特征适配器为基于非神经网络的特征适配器,将第一特征信息的通道数增加至与第j个中间层的输入通道数相同的方式包括但不限于如下几种:
方式一,若第j个中间层的输入通道数是第一特征信息的通道数的整数倍,则将第一特征信息的通道复制整数倍,以使复制后的第一特征信息的通道数与第j个中间层的输入通道数相同。
例如,第一特征信息的通道数为32,第j个中间层的输入通道数为64,则对第一特征信息的32个通道复制一份,得到64个通道的特征信息。
方式二,若第j个中间层的输入通道数不是第一特征信息的通道数的整数倍,则将第一特征信息的通道复制N倍,且从第一特征信息的通道中选出M个通道,对M个通道进行复制后与复制N倍的第一特征信息的通道进行合并,以使合并后的第一特征信息的通道数与第j个中间层的输入通道数相同,N为第j个中间层的输入通道数与第一特征信息的通道数相除后的商,M为第j个中间层的输入通道数与第一特征信息的通道数相除后的余数,N、M均为正整数。
例如,第一特征信息的通道数是64,第j个中间层的输入通道数是224,则224与64相除的商为3,余数为32,即N为3,M为32,则将第一特征信息的原有通道复制3份,得到192个通道,接着,从第一特征信息原有的64个通道中选出32个通道,对这32个通道进行复制,将复制后的32个通道与上述复制得到的192个通道合并,得到224个通道,这224个通道作为合并后的第一特征信息的通道。
可选的,上述从第一特征信息原有的64个通道中选出32个通道的方式可以是随机选择,也可以是采用PCA方式进行选择,或者其他方式进行选择,本申请对此不做限制。
上述32个通道与192个通道的合并方式可以是,将32个通道放置在192个通道后,或者放置在92个通道前,或者穿插在192个通道中,本申请对此不做限制。
方式三,从第一特征信息的通道中选出P个主要特征通道,对P个主要特征通道进行复制后与第一特征信息的通道进行合并,以使合并后的第一特征信息的通道数与第j个中间层的输入通道数相同,P为第j个中间层的输入通道数与第一特征信息的通道数之间的差值,P为正整数。
例如,第一特征信息的通道数是192,第j个中间层的输入通道数是256,256与192的差值为64,即P=64。从第一特征信息的192个通道中选出64个通道,对这64个通道进行复制后与第一特征信息的原有的192通道进行合并,得到256个通道。
可选的,上述从第一特征信息原有的192个通道中选出64个通道的方式可以是随机选择,也可以是采用PCA方式进行选择,或者其他方式进行选择,本申请对此不做限制。
该情况2中,若特征适配器为基于神经网络的特征适配器,则可以将所述第一特征信息输入所述特征适配器中, 通过所述特征适配器中的至少一个卷积层,将所述第一特征信息的通道数增加至与所述第j个中间层的输入通道数相同。可选的,可以通过增加特征适配器中的卷积层的数量,和/或增加卷积核的数量,以增加第一特征信息的通道数。
例如,第一特征信息的大小为
Figure PCTCN2021122473-appb-000003
第j个中间层的输入大小为
Figure PCTCN2021122473-appb-000004
此时通过特征适配器中的至少一个卷积层,将第一特征信息的大小增加至
Figure PCTCN2021122473-appb-000005
在一些实施例中,若特征信息的大小包括特征信息的尺寸,则上述特征适配器用于尺寸的适配。即上述S702-A1中将第一特征信息输入特征适配器中进行特征适配的实现方式包括如下几种情况:
情况1,若第一特征信息的尺寸大于第j个中间层的输入尺寸,则通过特征适配器将第一特征信息下采样至与第j个中间层的输入尺寸相同。
该情况1中,通过特征适配器将第一特征信息下采样至与第j个中间层的输入尺寸相同的方式包括但不限于如下几种:
方式一,若特征适配器为基于非神经网络的特征适配器,则通过述特征适配器,对第一特征信息下采样,以使下采样后的第一特征信息的尺寸与第j个中间层的输入尺寸相同。
例如,第一特征信息的大小为
Figure PCTCN2021122473-appb-000006
第j个中间层的输入大小为
Figure PCTCN2021122473-appb-000007
此时通过将通道数复制一倍,并进行上采样,来使特征维度匹配。
方式二,若特征适配器为基于神经网络的特征适配器,则通过特征适配器中的至少一个池化层,将第一特征信息的尺寸下采样至与第j个中间层的输入尺寸相同。
可选的,上述池化层可以为最大池化层、平均池化层、重叠池化层等。
情况2,若第一特征信息的尺寸小于第j个中间层的输入尺寸,则通过特征适配器将第一特征信息上采样至与第j个中间层的输入尺寸相同。
该情况2中,通过特征适配器将第一特征信息上采样至与第j个中间层的输入尺寸相同的方式包括但不限于如下几种:
方式一,若特征适配器为基于非神经网络的特征适配器,则通过特征适配器,对第一特征信息上采样,以使上采样后的第一特征信息的尺寸与第j个中间层的输入尺寸相同。
例如,第一特征信息的大小为
Figure PCTCN2021122473-appb-000008
第j个中间层的输入大小为
Figure PCTCN2021122473-appb-000009
此时可以通过上述采样,来使特征维度匹配。
方式二,若特征适配器为基于神经网络的特征适配器,则通过特征适配器中的至少一个上池化层,将第一特征信息的尺寸上采样至与第j个中间层的输入尺寸相同。
可选的,在该方式二中,特征适配器可以理解为上采样单元,例如特征适配器可以包括双线性插值层和/或反卷积层和/或反池化层和/或上池化层等。该特征适配器对第一特征信息进行上采样,使得上采样后的第一特征信息的尺寸与第j个中间层的输入尺寸相同。
继续以图9D所示的网络模型为例,上述特征适配器的输入端可以与解码网络的第i个中间层的输出端连接,该特征适配器的输出端与任务分析网络的第j个中间层的输入端连接,以将第i个中间层输出的第一特征信息的大小进行转换,以适配第j个中间层的输入大小。
示例性的,假设解码网络中节点d7与任务分析网络的F9节点连接,则如图10A所示,将节点d7对应的中间层作为第i个中间层,将F9对应的中间层作为第j个中间层,进而在节点d7和节点F9之间连接特征适配器,该特征适配器用于将d7对应的中间层输出的第一特征信息转换为第二特征信息后输入节点F9对应的中间层。
示例性的,假设解码网络中节点d5与任务分析网络的F5节点连接,则如图10B所示,将节点d5对应的中间层作为第i个中间层,将F5对应的中间层作为第j个中间层,进而在节点d5和节点F5之间连接特征适配器,该特征适配器用于将d5对应的中间层输出的第一特征信息转换为第二特征信息后输入节点F5对应的中间层。
示例性的,假设解码网络中节点d2与任务分析网络的F2节点连接,则如图10C所示,将节点d2对应的中间层作为第i个中间层,将F1对应的中间层作为第j个中间层,进而在节点d2和节点F1之间连接特征适配器,该特征适配器用于将d2对应的中间层输出的第一特征信息转换为第二特征信息后输入节点F1对应的中间层。
本申请实施例提供的视频解码方法,通过将当前图像的特征码流输入解码网络中,得到解码网络的第i个中间层输出的第一特征信息,i为正整数;将第一特征信息输入任务分析网络的第j个中间层中,得到任务分析网络输出的任务分析结果,j为正整数。本申请将解码网络中间层输出的特征信息输入任务分析网络中,使得任务分析网络基于解码网络输出的特征信息进行任务分析,节省了任务分析所占用的时间和计算资源,进而提高了任务分析的效率。
图11为本申请一实施例提供的视频解码方法流程示意图,如图11所示,本申请实施例的方法包括:
S801、将当前图像的特征码流输入解码网络中,得到解码网络的第i个中间层输出的第一特征信息,以及解码网络输出的当前图像的重建图像。
在该实施例中,不仅获取解码网络的第i个中间层输出的第一特征信息,还需要获取该解码网络最终输出的当前图像的重建图像。
也就是说,该实施例中,一方面,可以将解码网络的第i个中间层输出的第一特征信息,并将该第一特征信息输入任务分析网络的第j个中间层中,以使任务分析网络基于该第一特征信息进行任务分析,输出任务分析结果。另一方面,该解码网络继续进行后续的特征恢复,以实现当前图像的重建,输出当图像的重建图像,可以满足任务分析和图像显示的场景。
在一些实施例中,本申请实施例还可包括如下S802和S803的步骤。
S802、将重建图像输入任务分析网络中,得到任务分析网络的第j-1层输出的第三特征信息。
S803、将第三特征信息和第一特征信息输入第j个中间层中,得到任务分析网络输出的任务分析结果。
图12为本申请一实施例涉及的解码网络和任务分析网络的结构示意图,如图12所示,解码网络的第i个中间层与任务分析网络的第j个中间层连接,且解码网络的输出端与任务分析网络的输入端连接。
即在一些实施例中,获得解码网络的第i个中间层输出的第一特征信息,以及解码网络最终输出的当前图像的重建图像。接着,将当前图像的重建图像输入任务分析网络的输入端进行特征分析,得到任务分析网络的第j-1个中间层输出的第三特征信息。接着,将任务分析网络的第j-1个中间层输出的第三特征信息和解码网络的第i个中间层输出的第一特征信息输入任务分析网络的第j个中间层中,以使任务分析网络基于第三特征信息和第一特征信息进行任务分析,由于第三特征信息是通过重建图像获得的,可以反映出重建图像的特征,这样基于第一特征信息和第三特征信息,进行任务分析时,可以提高任务分析的准确性。
在一些实施例中,上述S803将第三特征信息和第一特征信息输入第j个中间层中,得到任务分析网络输出的任务分析结果的实现方式包括但不限于如下几种:
方式一,将第三特征信息和第一特征信息进行联合,将联合后的特征信息输入第j个中间层中,得到任务分析网络输出的任务分析结果。
可选的,上述联合的方式可以是不同权重的级联、不同权重的融合或加权平均等操作。
可选的,若第三特征信息和第一特征信息的大小不一致,则可以采用上述特征转换器,将第三特征信息和第一特征信息转换为大小一致后进行级联。
可选的,若级联后的特征信息与第j个中间层的输入大小不一致时,可以采用上述特征转换器,将级联后的特征信息的大小转换为与第j个中间层的输入大小一致后,输入第j个中间层中。
可选的,还可以在级联之前,先将第三特征信息和/或第一特征信息的大小进行转换,使得转换后的第一特征信息和/或第三特征信息级联后的大小与第j个中间层的输入大小一致。
方式二,将第三特征信息和第一特征信息进行相加,将相加后的特征信息输入第j个中间层中,得到任务分析网络输出的任务分析结果。
方式二,将第三特征信息和第一特征信息进行相乘,将相乘后的特征信息输入第j个中间层中,得到任务分析网络输出的任务分析结果。
在一些实施例中,在模型训练时,解码网络和任务分析网络一起进行端到端训练。
在一些实施例中,在模型训练时,解码网络和编码网络一起进行端到端训练。
在一些实施例中,在模型训练时,编码网络、解码网络和任务分析网络一起进行端到端训练。
在一些实施例中,若编码网络、解码网络和任务分析网络一起进行端到端训练,编码网络、解码网络和任务分析网络在训练时的目标损失是根据编码网络输出的特征信息码流的比特率、解码点概率分布码流的比特率和任务分析网络的任务分析结果损失中的至少一个确定的。
例如,目标损失为任务分析结果损失、特征信息码流的比特率和解码点概率分布码流的比特率三者之和。
再例如,目标损失为预设参数与任务分析结果损失的乘积、与特征信息码流的比特率和解码点概率分布码流的比特率之和。
示例性的,可以通过如下公式(1)确定编码网络、解码网络和任务分析网络一起进行端到端训练时的目标损失loss:
Figure PCTCN2021122473-appb-000010
其中,
Figure PCTCN2021122473-appb-000011
Figure PCTCN2021122473-appb-000012
分别是潜在特征表述的比特率(即特征码流的比特率),边信息的比特率(即解码点概率分布码流的比特率),λ表示预设参数,也称为率失真权衡参数,loss task为任务分析网络的任务分析结果损失,例如为任务分析网络的预测的任务分析结果和任务分析结果真值之间的损失。
可选的,预设参数与解码网络和任务分析网络中至少一个的网络模型相关,例如,不同的预设参数λ对应不同的模型,即不同的总比特率,总比特率为特征码流的比特率和边信息的比特率之和。
本申请实施例提供的视频解码方法,通过将当前图像的特征码流输入解码网络中,得到解码网络的第i个中间层输出的第一特征信息,以及解码网络输出的当前图像的重建图像;可选的,还将重建图像输入任务分析网络中,得到任务分析网络的第j-1层输出的第三特征信息;将第三特征信息和第一特征信息输入第j个中间层中,得到任务分析网络输出的任务分析结果,提高任务分析的准确性。
上文对申请视频解码方法进行了介绍,下面结合实施例,对本申请实施例涉及的视频编码方法进行介绍。
图13为本申请一实施例提供的视频编码方法的流程示意图,本申请实施例的执行主体可以为图1所示的编码器。如图13所示,本申请实施例的方法包括:
S901、获取待编码的当前图像。
S902、将当前图像输入编码网络中,得到编码网络输出的特征码流。
在模型训练时,上述编码网络和上述解码网络一起进行端到端训练,其中解码网络的第i个中间层输出的第一特征信息输入任务分析网络的第j个中间层中。
本申请的当前图像可以理解为视频流中待编码的一帧图像或该帧图像中的部分图像;或者,当前图像可以理解为单张的待编码图像或该张待编码图像中的部分图像。
在一些实施例中,如图14A所示,编码网络包括第一编码子网络和编码单元,此时,上述S902包括:
S902-A1、将当前图像输入第一编码子网络中,得到当前图像的初始特征信息;
S902-A2、将初始特征信息输入编码单元中,得到编码单元输出的特征码流。
可选的,上述编码单元可以为熵编码单元,用于对初始特征信息进行熵编码,得到当前图像的特征码流。可选的,该编码单元为神经网络。
在一些实施例中,如图14B所示,编码网络还包括量化单元,此时,上述S902-A2包括:
S902-A21、将初始特征信息输入量化单元中进行量化,得到量化后的特征信息;
S902-A22、将量化后的特征信息输入编码单元中,得到编码单元输出的特征码流。
本申请实施例对量化步长不做限制。
在一些实施例中,如图14C所示,编码网络还包括第二编码子网络,此时,本申请实施例的方法还包括:将初始特征信息输入第二编码子网络中进行解码点的概率分布估计,得到第二编码子网络输出的当前图像的解码点概率分布码流。
可选的,上述第二编码子网络为超先验网络。
示例性的,假设编码网络为上述Cheng2020编解码网络中的编码部分,其网络结构如图14D所示。将当前图像输入第一编码子网络中,经过卷积层以及注意力模块后,得到从当前图像中提取的特征信息,将该特征信息经过量化单元的量化和编码单元的熵编码生成特征码流。另外,将该特征信息经过第二编码子网络的概率分布估计,得到解码点出现的概率分布,对其进行量化和熵编码生成解码点概率分布码流。
可选的,图14D中的注意力模块(attention module)使用简化后的注意力模块(simplified attention module)来替代,其结构如图14E所示,其中RB(Residual block)表示残差块。注意力模块通常用于提高图像压缩的性能,但是通常使用的注意力模块在训练的时候非常耗时,所以通过去除非局部块来简化通用的注意力模块,以减小训练的复杂度。
在本申请的一些实施例中,在模型训练时,编码网络、解码网络和任务分析网络一起进行端到端训练。
在一些实施例中,编码网络、解码网络和任务分析网络在训练时的目标损失是根据编码网络输出的特征信息码流的比特率、解码点概率分布码流的比特率和任务分析网络的任务分析结果损失中的至少一个确定的。
示例性的,目标损失为预设参数与任务分析结果损失的乘积、与特征信息码流的比特率和解码点概率分布码流的比特率之和。
可选的,预设参数与解码网络和任务分析网络中至少一个的网络模型相关。
在一些实施例中,本申请实施例的编码网络和解码网络为端到端编解码网络。下面对本申请实施例可能会涉及到的几种端到端编解码网络进行介绍。
图15为通用端到端编解码网络示意图,其中ga可以理解为第一编码子网络,ha为第二编码子网络,gs为第一解码子网络,hs为第二解码子网络。在一些实施例中,第一编码子网络ga也称为主编码网络或主编码器,第一解码子网络gs称为主解码网络或主解码器,第二编码子网络ha和第二解码子网络hs称为超先验网络。通用端到端编解码网络的压缩特征流程为:输入的原始图片通过第一解码子网络g a得到特征信息y,特征信息y经过量化器Q后得到量化后的特征信息
Figure PCTCN2021122473-appb-000013
第二编码子网络ha(即超先验网络)对特征信息
Figure PCTCN2021122473-appb-000014
中的潜在表示用均值为0,方差为σ的高斯建模。由于编码单元AE以及解码单元AD在编解码阶段都需要得到解码点出现的概率分布,因此,第二编码子网络ha(即超先验网络)估计解码点的概率分布z,并将解码点的概率分布z进行量化为
Figure PCTCN2021122473-appb-000015
并对
Figure PCTCN2021122473-appb-000016
进行压缩,形成解码点概率分布码流。接着,将解码点概率分布码流输入到解码端,解码端进行解码,得到量化后的解码点的概率分布
Figure PCTCN2021122473-appb-000017
将解码点的概率分布
Figure PCTCN2021122473-appb-000018
输入解码端的第二解码子网络hs(即超先验网络)中,得到特征信息
Figure PCTCN2021122473-appb-000019
的建模分布,编码单元结合特征信息
Figure PCTCN2021122473-appb-000020
的建模分布解码特征码流,得到当前图像的特征信息
Figure PCTCN2021122473-appb-000021
将当前图像的特征信息
Figure PCTCN2021122473-appb-000022
输入第一解码子网络gs中,得到重建图像
Figure PCTCN2021122473-appb-000023
IGDN(inverse generalized divisive normalization)为逆广义除法归一化。
在一些实施例中,本申请实施例的端到端编解码网络为如上述图5C所示网络,在一些实施例中,该端到端编解码网络也称为Lee编解码网络。Lee编解码网络采用迁移学习的方法提升网络重建图像的质量。通过利用不同任务之间的内在可迁移性,Lee编解码网络在基础编解码网络的框架上增加了质量增强模块,例如GRDN(Grouped Residual Dense Network,分组残差密集网络)。Lee编解码网络的压缩流程为:将图像x输入到第一编码子网络g a(即主编码网络或变换分析网络)中得到隐含表示y,将y量化为y,对y进行编码,得到特征码流。将y输入到第二编码子网络h a(即超先验模型)中,使用超先验模型来进一步表示y的空间关系z,z为解码点出现的概率分布。接着,将z量化后的
Figure PCTCN2021122473-appb-000024
输入到熵编码器EC中对
Figure PCTCN2021122473-appb-000025
编码,形成解码点概率分布码流,可选的,该解码点概率分布码流也称为参数码流。通过第二解码子网络hs,从将参数码流重建得到超先验参数c′ i,从特征码流中得到全局上下文参数等模型参数c″ i和c″′ i。将超先验参数c′ i,全局上下文参数等模型参数输入到参数估计器f中,并同特征码流输入解码单元中得到y。第一解码子网络基于特征信息y来重建图像
Figure PCTCN2021122473-appb-000026
在一些实施例中,本申请实施例的端到端编解码网络为如上述图5D所示网络,在一些实施例中,该端到端编解码网络也称为Hu编解码网络。Hu编解码网络成功地以低比特率构建了紧凑和表达性的表示,用以支持包括高级语义相关任务和中级几何解析任务等多样化的机器视觉任务集。Hu编解码网络通过使用高层的语义图增强低级的视觉特征,并且验证了这种方法可以有效提升图像压缩的码率-精度-失真表现。Hu编解码网络的压缩流程为:首先从图像中提取出来的深度特征h i,将特征h i变换为便于编码和概率估计的离散值,由于特征分布未知,为了方便计算,引入了带隐变量z的高斯模型来估计特征分布,但是估计p z的边缘概率分布非常困难,所以使用超分析转换(Hyper Analysis Transform)模块来为z建立超先验v,将超先验v输入到算数编器中,使用参数化分布模型q v来近似估计概率分布p v,将q v的估计参数系数序列
Figure PCTCN2021122473-appb-000027
进行解码输出。接着使用码书{C 1,C 2,...,C τ}和系数序列A l共同生成带空间信息的超先验Z。最后使用算数编解码器对超先验Z的均值和方差进行估计,从而重建特征h′ i,其中重建的特征为考虑了空间维度的特征输出和不考虑空间维度的特征输出,分别用于执行智能任务和分析图像的统计特征。
在一些实施例中,本申请实施例的端到端编解码网络为如上述图5B所示网络,在一些实施例中,该端到端编解码网络也称为Cheng2020编解码网络。Cheng2020编解码网络的压缩过程与图15所示的通用端到端编解码网络的压缩过程一致,不同在于其使用的不是高斯模型,而是离散高斯混合似然,具体压缩过程参照上述图15部分的描述,在此不再赘述。
本申请实施例的端到端编解码网络除了上述图所示的端到端编解码网络外,还可以是其他的端到端的编解码网络。
在一些实施例中,编码网络和解码端是单独的神经网络,不是端到端的神经网络。
应理解,上述图7至图15仅为本申请的示例,不应理解为对本申请的限制。
以上结合附图详细描述了本申请的优选实施方式,但是,本申请并不限于上述实施方式中的具体细节,在本申请的技术构思范围内,可以对本申请的技术方案进行多种简单变型,这些简单变型均属于本申请的保护范围。例如,在上述具体实施方式中所描述的各个具体技术特征,在不矛盾的情况下,可以通过任何合适的方式进行组合,为了避免不必要的重复,本申请对各种可能的组合方式不再另行说明。又例如,本申请的各种不同的实施方式之间也可以进行任意组合,只要其不违背本申请的思想,其同样应当视为本申请所公开的内容。
还应理解,在本申请的各种方法实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。另外,本申请实施例中,术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系。具体地,A和/或B可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
上文结合图7至图15,详细描述了本申请的方法实施例,下文结合图16到图18,详细描述本申请的装置实施例。
图16是本申请实施例提供的视频解码器的示意性框图。
如图16所示,视频解码器10包括:
解码单元11,用于将当前图像的特征码流输入解码网络中,得到所述解码网络的第i个中间层输出的第一特征信息,所述i为正整数;
任务单元12,用于将所述第一特征信息输入任务分析网络的第j个中间层中,得到所述任务分析网络输出的任务分析结果,所述j为正整数。
在一些实施例中,任务单元12,具体用于将所述第一特征信息输入特征适配器中进行特征适配,得到第二特征信息,所述第二特征信息的大小与所述第j个中间层的预设输入大小一致;将所述第二特征信息输入所述第j个中间层中,得到所述任务分析网络输出的任务分析结果。
可选的,特征信息的大小包括特征信息的尺寸和/或通道数。
在一些实施例中,若特征信息的大小包括特征信息的通道数,则任务单元12,具体用于若所述第一特征信息的通道数大于所述第j个中间层的输入通道数,则通过所述特征适配器将所述第一特征信息的通道数减小至与所述第j个中间层的输入通道数相同;若所述第一特征信息的通道数小于所述第j个中间层的输入通道数,则通过所述特征适配器将所述第一特征信息的通道数增加至与所述第j个中间层的输入通道数相同。
在一些实施例中,任务单元12,具体用于若所述特征适配器为基于非神经网络的特征适配器,将所述第一特征信息输入所述特征适配器,以使所述特征适配器采用主成分分析PCA方式或随机选择的方式,从所述第一特征信息的通道中选出所述第j个中间层的输入通道数个通道;若所述特征适配器为基于神经网络的特征适配器,将所述第一特征信息输入所述特征适配器中,通过所述特征适配器中的至少一个卷积层,将所述第一特征信息的通道数减小至与所述第j个中间层的输入通道数相同。
在一些实施例中,若所述特征适配器为基于非神经网络的特征适配器,则任务单元12,具体用于若所述第j个中间层的输入通道数是所述第一特征信息的通道数的整数倍,则将所述第一特征信息的通道复制所述整数倍,以使复制后的所述第一特征信息的通道数与所述第j个中间层的输入通道数相同;或者,
若所述第j个中间层的输入通道数不是所述第一特征信息的通道数的整数倍,则将所述第一特征信息的通道复制N倍,且从所述第一特征信息的通道中选出M个通道,对所述M个通道进行复制后与复制N倍的第一特征信息的通道进行合并,以使合并后的所述第一特征信息的通道数与所述第j个中间层的输入通道数相同,所述N为所述第j个中间层的输入通道数与所述第一特征信息的通道数相除后的商,所述M为所述第j个中间层的输入通道数与所述第一特征信息的通道数相除后的余数,所述N、M均为正整数;或者,
从所述第一特征信息的通道中选出P个主要特征通道,对所述P个主要特征通道进行复制后与所述第一特征信息的通道进行合并,以使合并后的所述第一特征信息的通道数与所述第j个中间层的输入通道数相同,所述P为所述第j个中间层的输入通道数与所述第一特征信息的通道数之间的差值,所述P为正整数;
在一些实施例中,若所述特征适配器为基于神经网络的特征适配器,则任务单元12,具体用于将所述第一特征信息输入所述特征适配器中,通过所述特征适配器中的至少一个卷积层,将所述第一特征信息的通道数增加至与所述第j个中间层的输入通道数相同。
在一些实施例中,任务单元12,具体用于将所述第一特征信息输入所述特征适配器,以使所述特征适配器采用主成分分析PCA方式,从所述第一特征信息的通道中选择出与所述第j个中间层的输入通道数相同的通道。
在一些实施例中,任务单元12,具体用于若所述特征适配器为基于非神经网络的特征适配器,则通过所述特征适配器,对所述第一特征信息下采样,以使下采样后的所述第一特征信息的尺寸与所述第j个中间层的输入尺寸相同;若所述特征适配器为基于神经网络的特征适配器,则通过所述特征适配器中的至少一个池化层,将所述第一特征信息的尺寸下采样至与所述第j个中间层的输入尺寸相同。
可选的,所述池化层为最大池化层、平均池化层、重叠池化层中的任意一个。
在一些实施例中,任务单元12,具体用于若所述特征适配器为基于非神经网络的特征适配器,则通过所述特征适配器,对所述第一特征信息上采样,以使上采样后的所述第一特征信息的尺寸与所述第j个中间层的输入尺寸相同; 若所述特征适配器为基于神经网络的特征适配器,则通过所述特征适配器中的至少一个上池化层,将所述第一特征信息的尺寸上采样至与所述第j个中间层的输入尺寸相同。
在一些实施例中,若特征信息的大小包括特征信息的尺寸,则任务单元12,具体用于若所述第一特征信息的尺寸大于所述第j个中间层的输入尺寸,则通过所述特征适配器将所述第一特征信息下采样至与所述第j个中间层的输入尺寸相同;若所述第一特征信息的尺寸小于所述第j个中间层的输入尺寸,则通过所述特征适配器将所述第一特征信息上采样至与所述第j个中间层的输入尺寸相同。
在一些实施例中,解码单元11,还用于将所述当前图像的特征码流输入所述解码网络中,得到所述解码网络输出的所述当前图像的重建图像。
在一些实施例中,任务单元12,具体用于将所述重建图像输入所述任务分析网络中,得到所述任务分析网络的第j-1层输出的第三特征信息;将所述第三特征信息和所述第一特征信息输入所述第j个中间层中,得到所述任务分析网络输出的任务分析结果。
在一些实施例中,任务单元12,具体用于将所述第三特征信息和所述第一特征信息进行联合,将联合后的特征信息输入所述第j个中间层中,得到所述任务分析网络输出的任务分析结果。
在一些实施例中,所述解码网络包括解码单元和第一解码子网络,所述解码单元11,具体用于将所述当前图像的特征码流输入所述解码单元中,得到所述解码单元输出的所述当前图像的初始特征信息;将所述初始特征信息输入所述第一解码子网络中,得到所述第一解码子网络的第i个中间层输出的第一特征信息。
在一些实施例中,所述解码网络还包括反量化单元,所述解码单元11,具体用于将所述初始特征信息输入所述反量化单元中,得到反量化后的特征信息;将所述反量化后的特征信息输入所述第一解码子网络中,得到所述第一解码子网络的第i个中间层输出的第一特征信息。
在一些实施例中,所述解码网络还包括第二解码子网络,解码单元11,还用于将所述当前图像的解码点概率分布码流输入所述第二解码子网络中,得到所述当前图像的解码点的概率分布;将所述当前图像的特征码流和所述当前图像的解码点的概率分布输入所述解码单元中,得到所述解码单元输出的所述当前图像的初始特征信息。
可选的,在模型训练时,所述解码网络和所述任务分析网络一起进行端到端训练。
可选的,在模型训练时,所述解码网络和编码网络一起进行端到端训练。
可选的,在模型训练时,编码网络、所述解码网络和所述任务分析网络一起进行端到端训练。
在一些实施例中,所述编码网络、所述解码网络和所述任务分析网络在训练时的目标损失是根据所述编码网络输出的特征信息码流的比特率、解码点概率分布码流的比特率和所述任务分析网络的任务分析结果损失中的至少一个确定的。
示例性的,所述目标损失为预设参数与所述任务分析结果损失的乘积、与所述特征信息码流的比特率和所述解码点概率分布码流的比特率之和。
可选的,所述预设参数与所述解码网络和所述任务分析网络中至少一个的网络模型相关。
在一些实施例中,所述第i个中间层和所述第j个中间层为所述解码网络和所述任务分析网络中,特征相似度最高和/或模型损失最小的两个中间层。
在一些实施例中,所述第i个中间层和所述第j个中间层之间的特征相似度包括如下至少一个:所述第i个中间层输出的特征图和所述第j个中间层输入的特征图之间的相似度、所述第i个中间层输出的特征大小和所述第j个中间层输入的特征大小之间的相似度、所述第i个中间层输出的特征图的统计直方图和所述第j个中间层输入的特征图的统计直方图之间的相似度。
应理解,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图16所示的视频解码器10可以对应于执行本申请实施例的解码方法的相应主体,并且视频解码器10中的各个单元的前述和其它操作和/或功能分别为了实现解码方法等各个方法中的相应流程,为了简洁,在此不再赘述。
图17是本申请实施例提供的视频编码器的示意性框图。
如图17所示,视频编码器20包括:
获取单元21,用于获取待编码的当前图像;
编码单元22,用于将所述当前图像输入编码网络中,得到所述编码网络输出的特征码流,其中在模型训练时,所述编码网络和解码网络一起进行端到端训练,所述解码网络的第i个中间层输出的第一特征信息输入任务分析网络的第j个中间层中。
在一些实施例中,所述编码网络包括第一编码子网络和编码单元,所述编码单元22,具体用于将所述当前图像输入所述第一编码子网络中,得到所述当前图像的初始特征信息;将所述初始特征信息输入所述编码单元中,得到所述编码单元输出的特征码流。
在一些实施例中,所述编码网络还包括量化单元,所述编码单元22,具体用于将所述初始特征信息输入所述量化单元中进行量化,得到量化后的特征信息;将所述量化后的特征信息输入所述编码单元中,得到所述编码单元输出的特征码流。
在一些实施例中,所述编码网络还包括第二编码子网络,编码单元22,还用于将所述初始特征信息输入所述第二编码子网络中进行解码点的概率分布估计,得到所述第二编码子网络输出的所述当前图像的解码点概率分布码流。
可选的,在模型训练时,所述编码网络、解码网络和任务分析网络一起进行端到端训练。
在一些实施例中,所述编码网络、所述解码网络和所述任务分析网络在训练时的目标损失是根据所述编码网络输出的特征信息码流的比特率、解码点概率分布码流的比特率和所述任务分析网络的任务分析结果损失中的至少一个确定的。
示例性的,所述目标损失为预设参数与所述任务分析结果损失的乘积、与所述特征信息码流的比特率和所述解码点概率分布码流的比特率之和。
可选的,所述预设参数与所述解码网络和所述任务分析网络中至少一个的网络模型相关。
应理解,装置实施例与方法实施例可以相互对应,类似的描述可以参照方法实施例。为避免重复,此处不再赘述。具体地,图17所示的视频编码器20可以对应于执行本申请实施例的编码方法的相应主体,并且视频编码器20中的各个单元的前述和其它操作和/或功能分别为了实现编码方法等各个方法中的相应流程,为了简洁,在此不再赘述。
上文中结合附图从功能单元的角度描述了本申请实施例的装置和系统。应理解,该功能单元可以通过硬件形式实现,也可以通过软件形式的指令实现,还可以通过硬件和软件单元组合实现。具体地,本申请实施例中的方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路和/或软件形式的指令完成,结合本申请实施例公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件单元组合执行完成。可选地,软件单元可以位于随机存储器,闪存、只读存储器、可编程只读存储器、电可擦写可编程存储器、寄存器等本领域的成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法实施例中的步骤。
图18是本申请实施例提供的电子设备的示意性框图。
如图18所示,该电子设备30可以为本申请实施例所述的视频编码器,或者视频解码器,该电子设备30可包括:
存储器33和处理器32,该存储器33用于存储计算机程序34,并将该程序代码34传输给该处理器32。换言之,该处理器32可以从存储器33中调用并运行计算机程序34,以实现本申请实施例中的方法。
例如,该处理器32可用于根据该计算机程序34中的指令执行上述方法中的步骤。
在本申请的一些实施例中,该处理器32可以包括但不限于:
通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等等。
在本申请的一些实施例中,该存储器33包括但不限于:
易失性存储器和/或非易失性存储器。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。
在本申请的一些实施例中,该计算机程序34可以被分割成一个或多个单元,该一个或者多个单元被存储在该存储器33中,并由该处理器32执行,以完成本申请提供的方法。该一个或多个单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述该计算机程序34在该电子设备30中的执行过程。
如图18所示,该电子设备30还可包括:
收发器33,该收发器33可连接至该处理器32或存储器33。
其中,处理器32可以控制该收发器33与其他设备进行通信,具体地,可以向其他设备发送信息或数据,或接收其他设备发送的信息或数据。收发器33可以包括发射机和接收机。收发器33还可以进一步包括天线,天线的数量可以为一个或多个。
应当理解,该电子设备30中的各个组件通过总线系统相连,其中,总线系统除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。
图19是本申请实施例提供的视频编解码系统的示意性框图。
如图19所示,该视频编解码系统40可包括:视频编码器41和视频解码器42,其中视频编码器41用于执行本申请实施例涉及的视频编码方法,视频解码器42用于执行本申请实施例涉及的视频解码方法。
本申请还提供了一种码流,该码流通过上述编码方法生成的。
本申请还提供了一种计算机存储介质,其上存储有计算机程序,该计算机程序被计算机执行时使得该计算机能够执行上述方法实施例的方法。或者说,本申请实施例还提供一种包含指令的计算机程序产品,该指令被计算机执行时使得计算机执行上述方法实施例的方法。
当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例该的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外 的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。例如,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
以上该,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以该权利要求的保护范围为准。

Claims (41)

  1. 一种视频解码方法,其特征在于,包括:
    将当前图像的特征码流输入解码网络中,得到所述解码网络的第i个中间层输出的第一特征信息,所述i为正整数;
    将所述第一特征信息输入任务分析网络的第j个中间层中,得到所述任务分析网络输出的任务分析结果,所述j为正整数。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述第一特征信息输入任务分析网络的第j个中间层中,得到所述任务分析网络输出的任务分析结果,包括:
    将所述第一特征信息输入特征适配器中进行特征适配,得到第二特征信息,所述第二特征信息的大小与所述第j个中间层的预设输入大小一致;
    将所述第二特征信息输入所述第j个中间层中,得到所述任务分析网络输出的任务分析结果。
  3. 根据权利要求2所述的方法,其特征在于,特征信息的大小包括特征信息的尺寸和/或通道数。
  4. 根据权利要求3所述的方法,其特征在于,所述特征适配器包括基于神经网络的特征适配器和基于非神经网络的特征适配器。
  5. 根据权利要求4所述的方法,其特征在于,若特征信息的大小包括特征信息的通道数,则所述将所述第一特征信息输入特征适配器中进行特征适配,包括:
    若所述第一特征信息的通道数大于所述第j个中间层的输入通道数,则通过所述特征适配器将所述第一特征信息的通道数减小至与所述第j个中间层的输入通道数相同;
    若所述第一特征信息的通道数小于所述第j个中间层的输入通道数,则通过所述特征适配器将所述第一特征信息的通道数增加至与所述第j个中间层的输入通道数相同。
  6. 根据权利要求5所述的方法,其特征在于,所述通过所述特征适配器从所述第一特征信息的通道数减小至与所述第j个中间层的输入通道数相同,包括:
    若所述特征适配器为基于非神经网络的特征适配器,将所述第一特征信息输入所述特征适配器,以使所述特征适配器采用主成分分析PCA方式或随机选择的方式,从所述第一特征信息的通道中选出所述第j个中间层的输入通道数个通道;
    若所述特征适配器为基于神经网络的特征适配器,将所述第一特征信息输入所述特征适配器中,通过所述特征适配器中的至少一个卷积层,将所述第一特征信息的通道数减小至与所述第j个中间层的输入通道数相同。
  7. 根据权利要求5所述的方法,其特征在于,若所述特征适配器为基于非神经网络的特征适配器,则所述若所述第一特征信息的通道数小于所述第j个中间层的输入通道数,则通过所述特征适配器将所述第一特征信息的通道数增加至与所述第j个中间层的输入通道数相同,包括:
    若所述第j个中间层的输入通道数是所述第一特征信息的通道数的整数倍,则将所述第一特征信息的通道复制所述整数倍,以使复制后的所述第一特征信息的通道数与所述第j个中间层的输入通道数相同;或者,
    若所述第j个中间层的输入通道数不是所述第一特征信息的通道数的整数倍,则将所述第一特征信息的通道复制N倍,且从所述第一特征信息的通道中选出M个通道,对所述M个通道进行复制后与复制N倍的第一特征信息的通道进行合并,以使合并后的所述第一特征信息的通道数与所述第j个中间层的输入通道数相同,所述N为所述第j个中间层的输入通道数与所述第一特征信息的通道数相除后的商,所述M为所述第j个中间层的输入通道数与所述第一特征信息的通道数相除后的余数,所述N、M均为正整数;或者,
    从所述第一特征信息的通道中选出P个主要特征通道,对所述P个主要特征通道进行复制后与所述第一特征信息的通道进行合并,以使合并后的所述第一特征信息的通道数与所述第j个中间层的输入通道数相同,所述P为所述第j个中间层的输入通道数与所述第一特征信息的通道数之间的差值,所述P为正整数;
  8. 根据权利要求5所述的方法,其特征在于,若所述特征适配器为基于神经网络的特征适配器,则所述若所述第一特征信息的通道数小于所述第j个中间层的输入通道数,则通过所述特征适配器将所述第一特征信息的通道数增加至与所述第j个中间层的输入通道数相同,包括:
    将所述第一特征信息输入所述特征适配器中,通过所述特征适配器中的至少一个卷积层,将所述第一特征信息的通道数增加至与所述第j个中间层的输入通道数相同。
  9. 根据权利要求4所述的方法,其特征在于,若特征信息的大小包括特征信息的尺寸,则所述将所述第一特征信息输入特征适配器中进行特征适配,包括:
    若所述第一特征信息的尺寸大于所述第j个中间层的输入尺寸,则通过所述特征适配器将所述第一特征信息下采样至与所述第j个中间层的输入尺寸相同;
    若所述第一特征信息的尺寸小于所述第j个中间层的输入尺寸,则通过所述特征适配器将所述第一特征信息上采样至与所述第j个中间层的输入尺寸相同。
  10. 根据权利要求9所述的方法,其特征在于,所述若所述第一特征信息的尺寸大于所述第j个中间层的输入尺寸,则通过所述特征适配器将所述第一特征信息下采样至与所述第j个中间层的输入尺寸相同,包括:
    若所述特征适配器为基于非神经网络的特征适配器,则通过所述特征适配器,对所述第一特征信息下采样,以使下采样后的所述第一特征信息的尺寸与所述第j个中间层的输入尺寸相同;
    若所述特征适配器为基于神经网络的特征适配器,则通过所述特征适配器中的至少一个池化层,将所述第一特征信息的尺寸下采样至与所述第j个中间层的输入尺寸相同。
  11. 根据权利要求10所述的方法,其特征在于,所述池化层为最大池化层、平均池化层、重叠池化层中的任意一个。
  12. 根据权利要求9所述的方法,其特征在于,所述若所述第一特征信息的尺寸小于所述第j个中间层的输入尺 寸,则通过所述特征适配器将所述第一特征信息上采样至与所述第j个中间层的输入尺寸相同,包括:
    若所述特征适配器为基于非神经网络的特征适配器,则通过所述特征适配器,对所述第一特征信息上采样,以使上采样后的所述第一特征信息的尺寸与所述第j个中间层的输入尺寸相同;
    若所述特征适配器为基于神经网络的特征适配器,则通过所述特征适配器中的至少一个上池化层,将所述第一特征信息的尺寸上采样至与所述第j个中间层的输入尺寸相同。
  13. 根据权利要求1-12任一项所述的方法,其特征在于,所述方法还包括:
    将所述当前图像的特征码流输入所述解码网络中,得到所述解码网络输出的所述当前图像的重建图像。
  14. 根据权利要求13所述的方法,其特征在于,所述方法还包括:
    将所述重建图像输入所述任务分析网络中,得到所述任务分析网络的第j-1层输出的第三特征信息;
    所述将所述第一特征信息输入任务分析网络的第j个中间层中,得到所述任务分析网络输出的任务分析结果,包括:
    将所述第三特征信息和所述第一特征信息输入所述第j个中间层中,得到所述任务分析网络输出的任务分析结果。
  15. 根据权利要求14所述的方法,其特征在于,所述将所述第三特征信息和所述第一特征信息输入所述第j个中间层中,得到所述任务分析网络输出的任务分析结果,包括:
    将所述第三特征信息和所述第一特征信息进行联合,将联合后的特征信息输入所述第j个中间层中,得到所述任务分析网络输出的任务分析结果。
  16. 根据权利要求1-12任一项所述的方法,其特征在于,所述解码网络包括解码单元和第一解码子网络,所述将当前图像的特征码流输入解码网络中,得到所述解码网络的第i个中间层输出的第一特征信息,包括:
    将所述当前图像的特征码流输入所述解码单元中,得到所述解码单元输出的所述当前图像的初始特征信息;
    将所述初始特征信息输入所述第一解码子网络中,得到所述第一解码子网络的第i个中间层输出的第一特征信息。
  17. 根据权利要求16所述的方法,其特征在于,所述解码网络还包括反量化单元,所述将所述初始特征信息输入所述第一解码子网络中,得到所述第一解码子网络的第i个中间层输出的第一特征信息,包括:
    将所述初始特征信息输入所述反量化单元中,得到反量化后的特征信息;
    将所述反量化后的特征信息输入所述第一解码子网络中,得到所述第一解码子网络的第i个中间层输出的第一特征信息。
  18. 根据权利要求16所述的方法,其特征在于,所述解码网络还包括第二解码子网络,所述方法还包括:
    将所述当前图像的解码点概率分布码流输入所述第二解码子网络中,得到所述当前图像的解码点的概率分布;
    所述将所述当前图像的特征码流输入所述解码单元中,得到所述解码单元输出的所述当前图像的初始特征信息,包括:
    将所述当前图像的特征码流和所述当前图像的解码点的概率分布输入所述解码单元中,得到所述解码单元输出的所述当前图像的初始特征信息。
  19. 根据权利要求1-12任一项所述的方法,其特征在于,在模型训练时,所述解码网络和所述任务分析网络一起进行端到端训练。
  20. 根据权利要求1-12任一项所述的方法,其特征在于,在模型训练时,所述解码网络和编码网络一起进行端到端训练。
  21. 根据权利要求1-12任一项所述的方法,其特征在于,在模型训练时,编码网络、所述解码网络和所述任务分析网络一起进行端到端训练。
  22. 根据权利要求21所述的方法,其特征在于,所述编码网络、所述解码网络和所述任务分析网络在训练时的目标损失是根据所述编码网络输出的特征信息码流的比特率、解码点概率分布码流的比特率和所述任务分析网络的任务分析结果损失中的至少一个确定的。
  23. 根据权利要求22所述的方法,其特征在于,所述目标损失为预设参数与所述任务分析结果损失的乘积、与所述特征信息码流的比特率和所述解码点概率分布码流的比特率之和。
  24. 根据权利要求23所述的方法,其特征在于,所述预设参数与所述解码网络和所述任务分析网络中至少一个的网络模型相关。
  25. 根据权利要求1-12任一项所述的方法,其特征在于,所述第i个中间层和所述第j个中间层为所述解码网络和所述任务分析网络中,特征相似度最高和/或模型损失最小的两个中间层。
  26. 根据权利要求25所述的方法,其特征在于,所述第i个中间层和所述第j个中间层之间的特征相似度包括如下至少一个:所述第i个中间层输出的特征图和所述第j个中间层输入的特征图之间的相似度、所述第i个中间层输出的特征大小和所述第j个中间层输入的特征大小之间的相似度、所述第i个中间层输出的特征图的统计直方图和所述第j个中间层输入的特征图的统计直方图之间的相似度。
  27. 一种视频编码方法,其特征在于,包括:
    获取待编码的当前图像;
    将所述当前图像输入编码网络中,得到所述编码网络输出的特征码流;
    其中,在模型训练时,所述编码网络和解码网络一起进行端到端训练,所述解码网络的第i个中间层输出的第一特征信息输入任务分析网络的第j个中间层中。
  28. 根据权利要求27所述的方法,其特征在于,所述编码网络包括第一编码子网络和编码单元,所述将所述当前图像输入编码网络中,得到所述编码网络输出的特征码流,包括:
    将所述当前图像输入所述第一编码子网络中,得到所述当前图像的初始特征信息;
    将所述初始特征信息输入所述编码单元中,得到所述编码单元输出的特征码流。
  29. 根据权利要求28所述的方法,其特征在于,所述编码网络还包括量化单元,所述将所述初始特征信息输入所 述编码单元中,得到所述编码单元输出的特征码流,包括:
    将所述初始特征信息输入所述量化单元中进行量化,得到量化后的特征信息;
    将所述量化后的特征信息输入所述编码单元中,得到所述编码单元输出的特征码流。
  30. 根据权利要求28所述的方法,其特征在于,所述编码网络还包括第二编码子网络,所述方法还包括:
    将所述初始特征信息输入所述第二编码子网络中进行解码点的概率分布估计,得到所述第二编码子网络输出的所述当前图像的解码点概率分布码流。
  31. 根据权利要求27-30任一项所述的方法,其特征在于,在模型训练时,所述编码网络、解码网络和任务分析网络一起进行端到端训练。
  32. 根据权利要求31所述的方法,其特征在于,所述编码网络、所述解码网络和所述任务分析网络在训练时的目标损失是根据所述编码网络输出的特征信息码流的比特率、解码点概率分布码流的比特率和所述任务分析网络的任务分析结果损失中的至少一个确定的。
  33. 根据权利要求32所述的方法,其特征在于,所述目标损失为预设参数与所述任务分析结果损失的乘积、与所述特征信息码流的比特率和所述解码点概率分布码流的比特率之和。
  34. 根据权利要求33所述的方法,其特征在于,所述预设参数与所述解码网络和所述任务分析网络中至少一个的网络模型相关。
  35. 一种视频解码器,其特征在于,包括:
    解码单元,用于将当前图像的特征码流输入解码网络中,得到所述解码网络的第i个中间层输出的第一特征信息,所述i为正整数;
    任务网络,用于将所述第一特征信息输入任务分析网络的第j个中间层中,得到所述任务分析网络输出的任务分析结果,所述j为正整数。
  36. 一种视频编码器,其特征在于,包括:
    获取单元,用于获取待编码的当前图像;
    解码单元,用于将所述当前图像输入编码网络中,得到所述编码网络输出的特征码流,其中在模型训练时,所述编码网络和解码网络一起进行端到端训练,所述解码网络的第i个中间层输出的第一特征信息输入任务分析网络的第j个中间层中。
  37. 一种视频解码器,其特征在于,包括处理器和存储器;
    所示存储器用于存储计算机程序;
    所述处理器用于调用并运行所述存储器中存储的计算机程序,以实现上述权利要求1至26任一项所述的方法。
  38. 一种视频编码器,其特征在于,包括处理器和存储器;
    所示存储器用于存储计算机程序;
    所述处理器用于调用并运行所述存储器中存储的计算机程序,以实现如上述权利要求27至34任一项所述的方法。
  39. 一种视频编解码系统,其特征在于,包括:
    根据权利要求37所述的视频解码器;
    以及根据权利要求38所述的视频编码器。
  40. 一种计算机可读存储介质,其特征在于,用于存储计算机程序;
    所述计算机程序使得计算机执行如上述权利要求1至26或27至34任一项所述的方法。
  41. 一种码流,其特征在于,所述码流是通过如上述权利要求27至34任一项所述的方法生成的。
PCT/CN2021/122473 2021-09-30 2021-09-30 视频编解码方法、编码器、解码器及存储介质 WO2023050433A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/122473 WO2023050433A1 (zh) 2021-09-30 2021-09-30 视频编解码方法、编码器、解码器及存储介质
CN202180102730.0A CN118020306A (zh) 2021-09-30 2021-09-30 视频编解码方法、编码器、解码器及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/122473 WO2023050433A1 (zh) 2021-09-30 2021-09-30 视频编解码方法、编码器、解码器及存储介质

Publications (1)

Publication Number Publication Date
WO2023050433A1 true WO2023050433A1 (zh) 2023-04-06

Family

ID=85781225

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/122473 WO2023050433A1 (zh) 2021-09-30 2021-09-30 视频编解码方法、编码器、解码器及存储介质

Country Status (2)

Country Link
CN (1) CN118020306A (zh)
WO (1) WO2023050433A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529589A (zh) * 2016-11-03 2017-03-22 温州大学 采用降噪堆叠自动编码器网络的视觉目标检测方法
CN110334738A (zh) * 2019-06-05 2019-10-15 大连理工大学 用于图像识别的多分类网络的方法
CN112037225A (zh) * 2020-08-20 2020-12-04 江南大学 一种基于卷积神经的海洋船舶图像分割方法
CN112257858A (zh) * 2020-09-21 2021-01-22 华为技术有限公司 一种模型压缩方法及装置
WO2021050007A1 (en) * 2019-09-11 2021-03-18 Nanyang Technological University Network-based visual analysis
CN112587129A (zh) * 2020-12-01 2021-04-02 上海影谱科技有限公司 一种人体动作识别方法及装置
EP3859606A1 (en) * 2020-01-30 2021-08-04 Fujitsu Limited Training program, training method, and information processing apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529589A (zh) * 2016-11-03 2017-03-22 温州大学 采用降噪堆叠自动编码器网络的视觉目标检测方法
CN110334738A (zh) * 2019-06-05 2019-10-15 大连理工大学 用于图像识别的多分类网络的方法
WO2021050007A1 (en) * 2019-09-11 2021-03-18 Nanyang Technological University Network-based visual analysis
EP3859606A1 (en) * 2020-01-30 2021-08-04 Fujitsu Limited Training program, training method, and information processing apparatus
CN112037225A (zh) * 2020-08-20 2020-12-04 江南大学 一种基于卷积神经的海洋船舶图像分割方法
CN112257858A (zh) * 2020-09-21 2021-01-22 华为技术有限公司 一种模型压缩方法及装置
CN112587129A (zh) * 2020-12-01 2021-04-02 上海影谱科技有限公司 一种人体动作识别方法及装置

Also Published As

Publication number Publication date
CN118020306A (zh) 2024-05-10

Similar Documents

Publication Publication Date Title
US10623775B1 (en) End-to-end video and image compression
US11924445B2 (en) Instance-adaptive image and video compression using machine learning systems
WO2019001108A1 (zh) 视频处理的方法和装置
US11544606B2 (en) Machine learning based video compression
KR20240012374A (ko) 기계 학습 시스템들을 사용한 암시적 이미지 및 비디오 압축
TWI806199B (zh) 特徵圖資訊的指示方法,設備以及電腦程式
WO2022155974A1 (zh) 视频编解码以及模型训练方法与装置
JP2023512570A (ja) 画像処理方法および関連装置
KR20240054975A (ko) 기계 학습 시스템들을 사용하여 네트워크 파라미터 서브공간에서의 인스턴스 적응적 이미지 및 비디오 압축
CN115868161A (zh) 基于强化学习的速率控制
TWI826160B (zh) 圖像編解碼方法和裝置
WO2023193629A1 (zh) 区域增强层的编解码方法和装置
TW202348029A (zh) 使用限幅輸入數據操作神經網路
WO2023050433A1 (zh) 视频编解码方法、编码器、解码器及存储介质
WO2022194137A1 (zh) 视频图像的编解码方法及相关设备
KR20200044668A (ko) Ai 부호화 장치 및 그 동작방법, 및 ai 복호화 장치 및 그 동작방법
TW202337211A (zh) 條件圖像壓縮
CN114501031B (zh) 一种压缩编码、解压缩方法以及装置
WO2023206420A1 (zh) 视频编解码方法、装置、设备、系统及存储介质
WO2023165487A1 (zh) 特征域光流确定方法及相关设备
WO2023169303A1 (zh) 编解码方法、装置、设备、存储介质及计算机程序产品
US20240015314A1 (en) Method and apparatus for encoding or decoding a picture using a neural network
US20240121398A1 (en) Diffusion-based data compression
TW202345034A (zh) 使用條件權重操作神經網路
JP2024519791A (ja) 機械学習システムを使用する暗黙的画像およびビデオ圧縮

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21958999

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021958999

Country of ref document: EP

Effective date: 20240311