WO2023098688A1 - 图像编解码方法和装置 - Google Patents

图像编解码方法和装置 Download PDF

Info

Publication number
WO2023098688A1
WO2023098688A1 PCT/CN2022/135204 CN2022135204W WO2023098688A1 WO 2023098688 A1 WO2023098688 A1 WO 2023098688A1 CN 2022135204 W CN2022135204 W CN 2022135204W WO 2023098688 A1 WO2023098688 A1 WO 2023098688A1
Authority
WO
WIPO (PCT)
Prior art keywords
image feature
point
feature
nonlinear
image
Prior art date
Application number
PCT/CN2022/135204
Other languages
English (en)
French (fr)
Inventor
郭天生
王晶
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023098688A1 publication Critical patent/WO2023098688A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding

Definitions

  • the present application relates to the technical field of image processing, and in particular to an image encoding and decoding method and device.
  • CNN convolutional neural network
  • Some researchers have designed an end-to-end deep learning image/video compression algorithm, for example, the encoding network, entropy estimation network, entropy encoding network, entropy decoding network, decoding network and other modules are simultaneously optimized as a whole, where the encoding network and decoding Networks can also be called transform modules and inverse transform modules, and generally consist of convolutional layers and nonlinear transform units.
  • the nonlinear transformation unit is one of the basic components of the image/video compression network, and its nonlinear characteristics directly affect the rate-distortion performance of the compression algorithm. Therefore, designing a more efficient nonlinear transform unit is the key to further improving the rate-distortion performance in image/video compression algorithms.
  • Embodiments of the present application provide an image encoding and decoding method and device, which can realize efficient nonlinear transformation processing in an encoding/decoding network, and further improve the rate-distortion performance in an image/video compression algorithm.
  • the embodiment of the present application provides an image coding method, including: acquiring the first image feature to be processed; performing nonlinear transformation processing on the first image feature to obtain the processed image feature, and the nonlinear transformation
  • the processing includes the first non-linear operation, convolution processing and point-by-point multiplication in sequence; encoding is performed according to the processed image features to obtain a code stream.
  • the first image feature is obtained by the encoding end after obtaining the image to be processed and converting it from the image domain to the feature domain.
  • the conversion here may include but not limited to: 1. Convolution processing, using convolution layers to Extract features, have a local receptive field, and a weight sharing mechanism (that is, each filter slides to process input features). 2. Use multi-layer perceptron (MLP) or fully connected layer to extract features, with global receptive field characteristics, and weights are not shared. 3.
  • Transformer processing including matrix multiplication, MLP and normalization processing, has global receptive field characteristics and strong ability to capture long-distance dependencies.
  • a nonlinear operation is performed on each eigenvalue in the first image feature to obtain the second image feature; convolution processing is performed on the second image feature to obtain the third image feature, and the third image feature
  • the multiple eigenvalues of the first image feature correspond to the multiple eigenvalues of the first image feature; the corresponding multiple eigenvalues of the first image feature and the third image feature are multiplied point by point to obtain the processed image feature.
  • the first nonlinear operation is an operation performed on each eigenvalue in the above-mentioned first image feature, and may include absolute value operation, linear rectification function (Rectified Linear Unit, ReLU) series, Sigmoid, Tanh, piecewise linear (piecewise linear, PWL) operations, etc., where the linear rectification function is also called a modified linear unit.
  • ReLU linear rectification function
  • Sigmoid Sigmoid
  • Tanh piecewise linear (piecewise linear, PWL) operations
  • PWL piecewise linear
  • the first image feature is converted into the second image feature, and the second image feature and the first image feature can be expressed in the form of a matrix, since the first nonlinear operation is for each of the first image features eigenvalues, so each eigenvalue of the first image feature has an eigenvalue corresponding to it in the second image eigenvalue, so the matrix corresponding to the second image feature has the same size as the matrix corresponding to the first image feature, And the eigenvalues (matrix element values) at the same position correspond, for example, if the first image feature is expressed as a 3 ⁇ 3 matrix, then the second image feature can also be expressed as a 3 ⁇ 3 matrix, but because the first image feature and the second image feature have undergone the first nonlinear operation, so the eigenvalues in the first image feature and the second image feature are not exactly the same, and correspondingly, the element values in the matrices corresponding to the two are not exactly the same.
  • the third image feature can be considered as the local response (that is, the corrected value) of the second image feature, that is, the third image feature is the second image feature after convolution Because the receptive field of convolution processing is limited, the response value of each position in the image feature output by convolution processing is only related to the input feature value of the position adjacent to the position, so it is called for a local response.
  • the attention mechanism means that all the feature values in the first image feature, some are important, some are redundant, and the output of convolution processing can be each feature in the image feature
  • the weight of the value can correct the original eigenvalues, highlight the important eigenvalues, and suppress the redundant eigenvalues
  • the point-by-point multiplication operation uses the aforementioned local information to correct the value of each eigenvalue on the first image feature, without It is required that the convolution parameter must be a positive number, which avoids the limitation of the value space of the convolution parameter, and can obtain better convolution parameters in a wider value space, thereby achieving better image compression performance.
  • the nonlinear transformation processing further includes a point-by-point addition operation after the point-by-point multiplication operation.
  • a nonlinear operation is performed on each eigenvalue in the first image feature to obtain the second image feature; convolution processing is performed on the second image feature to obtain the third image feature, and the third image feature
  • the multiple eigenvalues of the first image feature correspond to the multiple eigenvalues in the first image feature; the first image feature and the corresponding multiple eigenvalues in the third image feature are multiplied point by point to obtain the fourth image feature, the first A plurality of eigenvalues in the four image features correspond to a plurality of eigenvalues in the first image feature; a point-by-point addition operation is performed on a plurality of eigenvalues corresponding to the first image feature and the fourth image feature to obtain the processed image features.
  • the point-by-point addition operation is a first-addition residual structure, which can make the codec network using the above processing process converge more easily during training.
  • the convolution processing conv1(x) is similar to the convolution processing conv2(x), the difference is that the convolution bias parameter ⁇ parameter in the convolution processing conv2(x) is increased by 1.
  • the conversion between the above two implementations can be realized by fine-tuning the bias parameter ⁇ in the convolution in the convolution processing, that is, if the nonlinear transformation unit includes convolution processing conv1(x) and point-by-point addition operation, the two can be fused into convolution processing conv2(x), thereby omitting the point-by-point addition operation.
  • the encoding side can continue to perform convolution processing on them, or perform nonlinear transformation processing on the output features of the convolution processing after convolution processing, and then encode the result features of the previous processing to get the code stream.
  • the input eigenvalues only need their own characteristics to obtain the output value, without considering the influence of the surrounding eigenvalues, and the original eigenvalues Correction, highlighting important eigenvalues, suppressing redundant eigenvalues, and correcting the value of each eigenvalue on the first image feature, avoiding the limitation of convolution parameters, so as to achieve efficient encoding in the network
  • Non-linear transformation processing further improves the rate-distortion performance of image/video compression algorithms.
  • the non-linear operation is the operation carried out for each eigenvalue in the above-mentioned first image feature, which may include adopting a piecewise linear mapping method, which may be to take the absolute value (take its absolute value for the input eigenvalue), or It is a rectified linear unit (ReLU) or a leaky rectified linear unit (LeakyReLU), or it can be a piecewise linear (PWL) operation.
  • ReLU is a piecewise linear mapping method.
  • the output of eigenvalues less than 0 is 0, and the eigenvalues greater than or equal to 0 are output identically;
  • LeakyReLU is a piecewise linear mapping method.
  • the input eigenvalues less than 0 are scaled with a preset weight, usually 0.01;
  • PWL operation is also a piecewise linear mapping method, and the number of segments in PWL operation can be more, the specific See the Examples section below for definitions.
  • the nonlinear operation may also include other methods, such as piecewise nonlinear operation, Tanh, and Sigmoid, which are not specifically limited in this embodiment of the present application.
  • the above-mentioned nonlinear transformation processing further includes a second nonlinear operation between the above-mentioned convolution processing and the point-by-point multiplication operation, and the second nonlinear operation is the same as or different from the first nonlinear operation. .
  • the first nonlinear operation can be an absolute value operation
  • the second nonlinear operation can still be an absolute value operation
  • the second nonlinear operation can also be other piecewise linear mapping methods or other nonlinear operations
  • the first The first nonlinear operation can be ReLU
  • the second nonlinear operation can still be ReLU
  • the second nonlinear operation can also be Sigmoid or other nonlinear operations
  • the first nonlinear operation can be LeakyReLU
  • the second nonlinear operation can be Still LeakyReLU
  • the second nonlinear operation may also be Tanh or other nonlinear operations
  • the first nonlinear operation may be PWL
  • the second nonlinear operation may still be PWL or other nonlinear operations.
  • the piecewise linear mapping can adopt different subsection numbers; the mapping slope on each section can be determined by training or directly specifying; for each of the input feature images Channels can use different piecewise linear functions, or use the same piecewise linear function for all channels, or use the same piecewise linear function for several channels.
  • the residual structure is no longer fused with the convolution, but can be fused with the piecewise linear function, that is, +1 is added to the output of the original piecewise linear function to form a new The piecewise linear function of .
  • the nonlinear transformation unit in the training stage can also be constructed, and the nonlinear transformation unit in the training stage includes a nonlinear operation layer, a convolution processing layer, a point-by-point multiplication operation layer, and a point-by-point phase add operation layer;
  • a trained nonlinear transformation unit is obtained through training according to pre-acquired training data, and the trained nonlinear transformation unit is used to implement the nonlinear transformation process.
  • the embodiment of the present application provides an image decoding method, including: acquiring the first image feature to be processed; performing nonlinear transformation processing on the first image feature to obtain the processed image feature, the nonlinear transformation
  • the processing sequentially includes a first nonlinear operation, convolution processing and point-by-point multiplication operation; a reconstructed image is obtained according to the processed image features.
  • the performing nonlinear transformation processing on the first image feature to obtain the processed image feature includes: performing the nonlinear transformation process on each feature value in the first image feature
  • the second image feature is obtained by operation;
  • the third image feature is obtained by performing convolution processing on the second image feature, and a plurality of eigenvalues in the third image feature are similar to a plurality of eigenvalues in the first image feature
  • the nonlinear transformation processing further includes a point-by-point addition operation.
  • the performing nonlinear transformation processing on the first image feature to obtain the processed image feature includes: performing the nonlinear transformation process on each feature value in the first image feature
  • the second image feature is obtained by operation;
  • the third image feature is obtained by performing convolution processing on the second image feature, and a plurality of eigenvalues in the third image feature are similar to a plurality of eigenvalues in the first image feature Corresponding;
  • the nonlinear operation includes piecewise linear mapping, such as ReLU, LeakyReLU, PWL, Abs, and the like.
  • the nonlinear operation includes a continuous function, such as Tanh, Sigmoid, and the like.
  • the nonlinear operation includes segmented nonlinear operation.
  • the embodiment of the present application provides an encoding device, including: an acquisition module, configured to acquire the first image feature to be processed; a transformation module, configured to perform nonlinear transformation processing on the first image feature to obtain the processed image features, the nonlinear transformation processing sequentially includes a first nonlinear operation, convolution processing, and point-by-point multiplication operations; an encoding module is used to encode according to the processed image features to obtain a code stream.
  • the transformation module is specifically configured to perform the first nonlinear operation on each feature value in the first image feature to obtain a second image feature; Performing the convolution processing on features to obtain a third image feature, where multiple feature values in the third image feature correspond to multiple feature values in the first image feature; combine the first image feature with the performing the point-by-point multiplication operation on a plurality of corresponding feature values in the third image feature to obtain the processed image feature.
  • the nonlinear transformation processing further includes a point-by-point addition operation after the point-by-point multiplication operation.
  • the transformation module is specifically configured to perform the first nonlinear operation on each feature value in the first image feature to obtain a second image feature; Performing the convolution processing on features to obtain a third image feature, where multiple feature values in the third image feature correspond to multiple feature values in the first image feature; combine the first image feature with the Perform the point-by-point multiplication operation on the corresponding multiple feature values in the third image feature to obtain the fourth image feature, and the multiple feature values in the fourth image feature and the multiple feature values in the first image feature
  • the eigenvalues are corresponding; performing the point-by-point addition operation on the plurality of eigenvalues corresponding to the first image feature and the fourth image feature to obtain the processed image feature.
  • the nonlinear transformation processing further includes a second nonlinear operation between the convolution processing and the point-by-point multiplication operation, the second nonlinear operation and the
  • the first nonlinear operations are the same or different.
  • the first nonlinear operation may be an absolute value operation
  • the second nonlinear operation may still be an absolute value operation
  • the second nonlinear operation may also be called piecewise linear mapping or other nonlinear operations.
  • the nonlinear operation includes piecewise linear mapping, such as ReLU, LeakyReLU, PWL, Abs, and the like.
  • the nonlinear operation includes a continuous function, such as Tanh, Sigmoid, and the like.
  • the nonlinear operation includes segmented nonlinear operation.
  • a training module configured to construct a nonlinear transformation unit in the training phase, and the nonlinear transformation unit in the training phase includes a first nonlinear operation layer, a convolution processing layer, a point-by-point A multiplication operation layer and a point-by-point addition operation layer; a trained nonlinear transformation unit is obtained by training according to pre-acquired training data, and the trained nonlinear transformation unit is used to realize the nonlinear transformation process.
  • the embodiment of the present application provides a decoding device, including: an acquisition module, configured to acquire the first image feature to be processed; a transformation module, configured to perform nonlinear transformation processing on the first image feature to obtain a processed image features, the nonlinear transformation processing includes the first nonlinear operation, convolution processing and point-by-point multiplication operation; the reconstruction module is used to obtain the reconstructed image according to the processed image features.
  • the transformation module is specifically configured to perform the first nonlinear operation on each feature value in the first image feature to obtain a second image feature; Performing the convolution processing on features to obtain a third image feature, where multiple feature values in the third image feature correspond to multiple feature values in the first image feature; combine the first image feature with the performing the point-by-point multiplication operation on a plurality of corresponding feature values in the third image feature to obtain the processed image feature.
  • the nonlinear transformation processing further includes a point-by-point addition operation after the point-by-point multiplication operation.
  • the transformation module is specifically configured to perform the first nonlinear operation on each feature value in the first image feature to obtain a second image feature; Performing the convolution processing on features to obtain a third image feature, where multiple feature values in the third image feature correspond to multiple feature values in the first image feature; combine the first image feature with the Perform the point-by-point multiplication operation on the corresponding multiple feature values in the third image feature to obtain the fourth image feature, and the multiple feature values in the fourth image feature and the multiple feature values in the first image feature
  • the eigenvalues are corresponding; performing the point-by-point addition operation on the plurality of eigenvalues corresponding to the first image feature and the fourth image feature to obtain the processed image feature.
  • the nonlinear transformation processing further includes a second nonlinear operation between the convolution processing and the point-by-point multiplication operation, the second nonlinear operation and the The first nonlinear operations are the same or different.
  • the nonlinear operation includes piecewise linear mapping, such as ReLU, LeakyReLU, PWL, Abs, and the like.
  • the nonlinear operation includes a continuous function, such as Tanh, Sigmoid, and the like.
  • the nonlinear operation includes segmented nonlinear operation.
  • a training module configured to construct a nonlinear transformation unit in the training phase, and the nonlinear transformation unit in the training phase includes a first nonlinear operation layer, a convolution processing layer, a point-by-point A multiplication operation layer and a point-by-point addition operation layer; a trained nonlinear transformation unit is obtained by training according to pre-acquired training data, and the trained nonlinear transformation unit is used to realize the nonlinear transformation process.
  • the embodiment of the present application provides an encoder, including: one or more processors; a non-transitory computer-readable storage medium, coupled to the processor and storing a program executed by the processor, wherein The program, when executed by the processor, causes the decoder to perform the method described in any one of the above first aspects.
  • the embodiment of the present application provides a decoder, including: one or more processors; a non-transitory computer-readable storage medium, coupled to the processor and storing a program executed by the processor, wherein The program, when executed by the processor, causes the decoder to perform the method described in any one of the above second aspects.
  • the embodiment of the present application provides a computer program product including program code, which is used to execute the method described in any one of the first to second aspects above when the program code is executed on a computer or a processor .
  • the embodiments of the present application provide a computer-readable storage medium, including instructions, which, when run on a computer, cause the computer to execute the method described in any one of the first to second aspects above.
  • an embodiment of the present application provides a code stream, which is generated by a processor executing the method described in any one of the first aspect.
  • the embodiment of the present application provides a device for storing code streams, the device comprising: a receiver and at least one storage medium, the receiver is used to receive the code stream; at least one storage medium is used to store the code stream; the code stream is based on The code stream generated by the method described in any one of the first aspects.
  • the embodiment of the present application provides a device for transmitting a code stream, the device including: a transmitter and at least one storage medium, the at least one storage medium is used to store the code stream, and the code stream includes the processor according to the first aspect The code stream generated by any one of the methods; the transmitter is used to send the code stream to other electronic devices.
  • the embodiment of the present application provides a system for distributing code streams.
  • the system includes: at least one storage medium and a streaming media device, at least one storage medium is used to store at least one code stream, and at least one code stream includes The code stream generated by any one of the implementations of the aspect; the streaming media device is used to obtain the target code stream from at least one storage medium, and send the target code stream to the end-side device, wherein the streaming media device includes a content server or a content server distribution server.
  • the embodiment of the present application provides a system for distributing code streams, the system comprising: a communication interface, configured to receive a user's request for acquiring a target code stream; a processor, configured to determine the target code stream in response to the user request The storage location of the target code stream; the communication interface is also used to send the storage location of the target code stream to the user, so that the user obtains the target code stream from the storage location of the target code stream, wherein the target code stream is executed by the processor according to any one of the first aspects generated by the method described.
  • FIG. 1A is a schematic block diagram of an exemplary decoding system 10
  • FIG. 1B is an illustrative diagram of an example of a video coding system 40
  • FIG. 2 is a schematic diagram of a video decoding device 400 provided by an embodiment of the present invention.
  • FIG. 3 is a simplified block diagram of an apparatus 500 provided by an exemplary embodiment
  • Fig. 4 is an example diagram of an end-to-end deep learning image codec framework
  • Fig. 5 is an example diagram of an end-to-end deep learning video codec framework
  • FIG. 6 is an example diagram of an application scenario of an embodiment of the present application.
  • FIG. 7 is an example diagram of an application scenario of an embodiment of the present application.
  • FIG. 8 is an example diagram of an application scenario of an embodiment of the present application.
  • FIG. 9 is a flowchart of a process 900 of an image encoding method according to an embodiment of the present application.
  • Figure 10a is a structural diagram of a nonlinear transformation unit with a local attention mechanism
  • Fig. 10b is a structural diagram of a residual nonlinear transformation unit with a local attention mechanism
  • Figure 10c is a structural diagram of a residual nonlinear transformation unit with an attention mechanism
  • Figure 10d is a structural diagram of a nonlinear transformation unit with an attention mechanism
  • Figure 11 is a schematic diagram of the PWL function
  • Fig. 12 is a schematic diagram of convolution processing
  • Fig. 13 is a structural diagram of an encoding network
  • FIG. 14 is a flowchart of a process 1300 of an image decoding method according to an embodiment of the present application.
  • Figure 15 is a structural diagram of the decoding network
  • Figure 16a is an exemplary structural diagram of ResAU
  • Figure 16b is an exemplary structural diagram of ResAU
  • Figure 16c is an exemplary structural diagram of ResAU
  • Figure 16d is an exemplary structural diagram of ResAU
  • Figure 16e is an exemplary structural diagram of ResAU
  • Figure 17a shows the overall performance of ResAU on the 24 images of the Kodak test set
  • Figure 17b shows the overall performance of ResAU on the 24 images of the Kodak test set
  • FIG. 18 is a schematic structural diagram of an exemplary encoding device 1700 according to an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of an exemplary decoding apparatus 1800 according to an embodiment of the present application.
  • At least one (item) means one or more, and “multiple” means two or more.
  • “And/or” is used to describe the association relationship of associated objects, indicating that there can be three types of relationships, for example, “A and/or B” can mean: only A exists, only B exists, and A and B exist at the same time , where A and B can be singular or plural.
  • the character “/” generally indicates that the contextual objects are an “or” relationship.
  • At least one of the following” or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • At least one item (piece) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c ", where a, b, c can be single or multiple.
  • Bit rate In image compression, the average coding length required for coding unit pixels.
  • Rate-distortion performance An index used to measure the performance of the compression algorithm, which comprehensively considers the two data of the bit rate and the degree of distortion of the decoded image.
  • Attention mechanism It is a means of screening high-value information from a large amount of information by using limited attention resources.
  • a neural network can be made to focus more on relevant parts of the input and less on irrelevant parts.
  • Nonlinear transformation unit a network unit that includes nonlinear operations (such as ReLU, Sigmoid, Tanh, PWL, etc. operations), and the overall calculation method of this unit does not conform to the linear characteristic.
  • nonlinear operations such as ReLU, Sigmoid, Tanh, PWL, etc. operations
  • Neural network (neural network, NN) is a machine learning model.
  • a neural network can be composed of neural units.
  • a neural unit can refer to a computing unit that takes xs and intercept 1 as input.
  • the output of the computing unit can be:
  • Ws is the weight of xs
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a nonlinear function such as ReLU.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Multi-layer perception (MLP)
  • MLP is a simple deep neural network (deep neural network, DNN) (different layers are fully connected), also known as a multi-layer neural network, which can be understood as a neural network with many hidden layers.
  • DNN deep neural network
  • Manton has no particular metric.
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, it is actually not complicated in terms of the work of each layer.
  • the coefficient of the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as It should be noted that the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, the more complex the model with more parameters, the greater the "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • CNN Convolutional neural network
  • a convolutional neural network consists of a feature extractor consisting of convolutional and pooling layers. The feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolution feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • the convolution layer can include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially Is a weight matrix, this weight matrix is usually pre-defined, in the process of convolution operation on the image, the weight matrix is usually along the horizontal direction of the input image pixel by pixel (or two pixels by two pixels... ...This depends on the value of the stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image.
  • the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row ⁇ column) are applied, That is, multiple matrices of the same shape.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image.
  • one weight matrix is used to extract image edge information
  • another weight matrix is used to extract specific colors of the image
  • another weight matrix is used to filter unwanted noise in the image. Do blurring etc.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple feature maps of the same size are combined to form the convolution operation. output.
  • the weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network can make correct predictions.
  • the initial convolutional layer often extracts more general features, which can also be called low-level features; as the depth of the convolutional neural network deepens.
  • the features extracted by the later convolutional layers are more and more complex, such as high-level semantic features, and the higher semantic features are more suitable for the problem to be solved.
  • pooling layer After a convolutional layer. It can be a convolutional layer followed by a pooling layer, or a multi-layer convolutional layer followed by a pooling layer. layer or multiple pooling layers.
  • the sole purpose of pooling layers is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling.
  • the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network After being processed by the convolutional layer/pooling layer, the convolutional neural network is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network needs to use the neural network layer to generate an output of one or a set of required classes. Therefore, the neural network layer can include multiple hidden layers, and the parameters contained in the multi-layer hidden layers can be pre-trained according to the relevant training data of the specific task type. For example, the task type can include image recognition, Image classification, image super-resolution reconstruction and more.
  • the output layer of the entire convolutional neural network is also included.
  • This output layer has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error.
  • Recurrent neural networks are used to process sequence data.
  • the layers are fully connected, and each node in each layer is unconnected.
  • RNN Recurrent neural networks
  • this ordinary neural network solves many problems, it is still powerless to many problems. For example, if you want to predict what the next word in a sentence is, you generally need to use the previous words, because the preceding and following words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
  • RNN can process sequence data of any length.
  • the training of RNN is the same as that of traditional CNN or DNN.
  • the error backpropagation algorithm is also used, but there is a difference: that is, if the RNN is expanded to the network, then the parameters, such as W, are shared; while the above-mentioned traditional neural network is not the case.
  • the output of each step depends not only on the network of the current step, but also depends on the state of the previous several steps of the network. This learning algorithm is called Back propagation Through Time (BPTT) based on time.
  • BPTT Back propagation Through Time
  • the convolutional neural network can use the back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate an error loss, and updating the parameters in the initial super-resolution model by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the super-resolution model, such as the weight matrix.
  • GAN Generative adversarial networks
  • the model includes at least two modules: one module is a Generative Model, and the other is a Discriminative Model. These two modules learn from each other through games to produce better output.
  • Both the generative model and the discriminative model can be neural networks, specifically deep neural networks or convolutional neural networks.
  • the basic principle of GAN is as follows: Taking the GAN that generates pictures as an example, suppose there are two networks, G (Generator) and D (Discriminator), where G is a network that generates pictures, which receives a random noise z, and passes this noise Generate a picture, denoted as G(z); D is a discriminant network, used to determine whether a picture is "real".
  • Its input parameter is x
  • x represents a picture
  • the output D(x) represents the probability that x is a real picture. If it is 1, it means that 130% is a real picture. If it is 0, it means that it cannot be real. picture.
  • the goal of the generation network G is to generate real pictures as much as possible to deceive the discriminant network D
  • the goal of the discriminant network D is to distinguish the pictures generated by G from the real pictures as much as possible. Come. In this way, G and D constitute a dynamic "game” process, which is the "confrontation" in the "generative confrontation network”.
  • CNN convolutional neural network
  • Some researchers have designed an end-to-end deep learning image/video compression algorithm, for example, the encoding network, entropy estimation network, entropy encoding network, entropy decoding network, decoding network and other modules are simultaneously optimized as a whole, where the encoding network and decoding Networks can also be called transform modules and inverse transform modules, and generally consist of convolutional layers and nonlinear transform units.
  • FIG. 1A is a schematic block diagram of an exemplary decoding system 10 , such as a video decoding system 10 (or simply referred to as the decoding system 10 ), which may utilize the techniques of the present application.
  • Video encoder 20 (or simply encoder 20) and video decoder 30 (or simply decoder 30) in video coding system 10 represent devices, etc. that may be used to perform techniques according to various examples described in this application. .
  • a decoding system 10 includes a source device 12 for providing encoded image data 21 such as encoded images to a destination device 14 for decoding the encoded image data 21 .
  • the source device 12 includes an encoder 20 , and optionally, an image source 16 , a preprocessor (or a preprocessing unit) 18 such as an image preprocessor, and a communication interface (or a communication unit) 22 .
  • Image source 16 may include or be any type of image capture device for capturing real world images, etc., and/or any type of image generation device, such as a computer graphics processor or any type of Devices for acquiring and/or providing real-world images, computer-generated images (e.g., screen content, virtual reality (VR) images, and/or any combination thereof (e.g., augmented reality (AR) images). So
  • the image source may be any type of memory or storage that stores any of the above images.
  • the image (or image data) 17 may also be referred to as an original image (or original image data) 17 .
  • the preprocessor 18 is used to receive (original) image data 17, and preprocess the image data 17 to obtain a preprocessed image (or preprocessed image data) 19.
  • preprocessing performed by preprocessor 18 may include cropping, color format conversion (eg, from RGB to YCbCr), color grading, or denoising. It can be understood that the preprocessing unit 18 can be an optional component.
  • a video encoder (or encoder) 20 is used to receive preprocessed image data 19 and provide encoded image data 21 (further described below with reference to FIG. 2 etc.).
  • the communication interface 22 in the source device 12 may be used to receive the encoded image data 21 and send the encoded image data 21 (or any other processed version) via the communication channel 13 to another device such as the destination device 14 or any other device for storage Or rebuild directly.
  • the destination device 14 includes a decoder 30 , and may also optionally include a communication interface (or communication unit) 28 , a post-processor (or post-processing unit) 32 and a display device 34 .
  • the communication interface 28 in the destination device 14 is used to receive the coded image data 21 (or any other processed version) directly from the source device 12 or from any other source device such as a storage device, for example, the storage device is a coded image data storage device, And the coded image data 21 is supplied to the decoder 30 .
  • the communication interface 22 and the communication interface 28 can be used to pass through a direct communication link between the source device 12 and the destination device 14, such as a direct wired or wireless connection, etc., or through any type of network, such as a wired network, a wireless network, or any other Combination, any type of private network and public network or any combination thereof, send or receive coded image data (or coded data) 21 .
  • the communication interface 22 can be used to encapsulate the encoded image data 21 into a suitable format such as a message, and/or use any type of transmission encoding or processing to process the encoded image data, so that it can be transmitted over a communication link or communication network on the transmission.
  • the communication interface 28 corresponds to the communication interface 22, eg, can be used to receive the transmission data and process the transmission data using any type of corresponding transmission decoding or processing and/or decapsulation to obtain the encoded image data 21 .
  • Both the communication interface 22 and the communication interface 28 can be configured as a one-way communication interface as indicated by an arrow pointing from the source device 12 to the corresponding communication channel 13 of the destination device 14 in FIG. 1A , or a two-way communication interface, and can be used to send and receive messages etc., to establish the connection, confirm and exchange any other information related to the communication link and/or data transmission such as encoded image data transmission, etc.
  • the video decoder (or decoder) 30 is used to receive encoded image data 21 and provide decoded image data (or decoded image data) 31 (which will be further described below with reference to FIG. 3 , etc.).
  • the post-processor 32 is used to perform post-processing on decoded image data 31 (also referred to as reconstructed image data) such as a decoded image to obtain post-processed image data 33 such as a post-processed image.
  • Post-processing performed by post-processing unit 32 may include, for example, color format conversion (e.g., from YCbCr to RGB), color grading, cropping, or resampling, or any other processing for producing decoded image data 31 for display by a display device 34 or the like. .
  • the display device 34 is used to receive the post-processed image data 33 to display the image to a user or viewer or the like.
  • Display device 34 may be or include any type of display for representing the reconstructed image, eg, an integrated or external display screen or display.
  • the display screen may include a liquid crystal display (LCD), an organic light emitting diode (OLED) display, a plasma display, a projector, a micro LED display, a liquid crystal on silicon (LCoS) display, or a liquid crystal on silicon (LCoS) display. ), a digital light processor (DLP), or any type of other display.
  • LCD liquid crystal display
  • OLED organic light emitting diode
  • plasma display e.g., a plasma display, a projector, a micro LED display, a liquid crystal on silicon (LCoS) display, or a liquid crystal on silicon (LCoS) display.
  • DLP digital light processor
  • the decoding system 10 also includes a training engine 25, which is used to train the encoder 20 or the decoder 30 to realize the conversion between the image domain and the feature domain.
  • the training data can be stored in a database (not shown), and the training engine 25 can train the encoding/decoding network based on the training data. It should be noted that the embodiment of the present application does not limit the source of the training data, for example, the training data may be obtained from the cloud or other places for model training.
  • FIG. 1A shows the source device 12 and the destination device 14 as independent devices
  • device embodiments may also include the source device 12 and the destination device 14 or the functions of the source device 12 and the destination device 14 at the same time, that is, include both the source device 12 and the destination device 14 at the same time.
  • Device 12 or corresponding function and destination device 14 or corresponding function In these embodiments, source device 12 or corresponding functionality and destination device 14 or corresponding functionality may be implemented using the same hardware and/or software or by separate hardware and/or software or any combination thereof.
  • Encoder 20 e.g., video encoder 20
  • decoder 30 e.g., video decoder 30
  • processing circuitry such as one or more microprocessors, digital signal processors, (digital signal processor, DSP), application-specific integrated circuit (ASIC), field-programmable gate array (field-programmable gate array, FPGA), discrete logic, hardware, video encoding dedicated processor or any combination thereof .
  • Encoder 20 may be implemented by processing circuitry 46 to include the various modules discussed with reference to encoder 20 of FIG. 2 and/or any other encoder system or subsystem described herein.
  • Decoder 30 may be implemented by processing circuitry 46 to include the various modules discussed with reference to decoder 30 of FIG.
  • the processing circuitry 46 may be used to perform various operations discussed below. As shown in FIG. 8, if part of the technology is implemented in software, the device can store the instructions of the software in a suitable non-transitory computer-readable storage medium, and use one or more processors to execute the instructions in hardware, thereby Perform the inventive technique.
  • One of the video encoder 20 and the video decoder 30 may be integrated in a single device as part of a combined codec (encoder/decoder, CODEC), as shown in FIG. 1B .
  • Source device 12 and destination device 14 may comprise any of a variety of devices, including any type of handheld or stationary device, such as a notebook or laptop computer, cell phone, smartphone, tablet or tablet computer, camera, desktop computers, set-top boxes, televisions, display devices, digital media players, video game consoles, video streaming devices (such as content service servers or content distribution servers), broadcast receiving devices, broadcast transmitting devices, etc., and may not Use or use any type of operating system.
  • source device 12 and destination device 14 may be equipped with components for wireless communication. Accordingly, source device 12 and destination device 14 may be wireless communication devices.
  • the video coding system 10 shown in FIG. 1A is merely exemplary, and the techniques provided herein may be applicable to video coding settings (e.g., video coding or video decoding) that do not necessarily include coding devices and Decode any data communication between devices.
  • data is retrieved from local storage, sent over a network, and so on.
  • a video encoding device may encode and store data into memory, and/or a video decoding device may retrieve and decode data from memory.
  • encoding and decoding are performed by devices that do not communicate with each other but simply encode data to memory and/or retrieve and decode data from memory.
  • FIG. 1B is an illustrative diagram of an example of video coding system 40 .
  • the video decoding system 40 may include an imaging device 41, a video encoder 20, a video decoder 30 (and/or a video encoder/decoder implemented by a processing circuit 46), an antenna 42, one or more processors 43, a or multiple memory stores 44 and/or a display device 45 .
  • imaging device 41 , antenna 42 , processing circuit 46 , video encoder 20 , video decoder 30 , processor 43 , memory storage 44 and/or display device 45 are capable of communicating with each other.
  • the video coding system 40 may include only the video encoder 20 or only the video decoder 30 .
  • antenna 42 may be used to transmit or receive an encoded bitstream of video data.
  • display device 45 may be used to present video data.
  • the processing circuit 46 may include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, and the like.
  • the video decoding system 40 may also include an optional processor 43, and the optional processor 43 may similarly include application-specific integrated circuit (ASIC) logic, a graphics processor, a general-purpose processor, and the like.
  • the memory storage 44 can be any type of memory, such as volatile memory (for example, static random access memory (static random access memory, SRAM), dynamic random access memory (dynamic random access memory, DRAM), etc.) or non-volatile memory volatile memory (for example, flash memory, etc.) and the like.
  • volatile memory for example, static random access memory (static random access memory, SRAM), dynamic random access memory (dynamic random access memory, DRAM), etc.
  • non-volatile memory volatile memory for example, flash memory, etc.
  • memory storage 44 may be implemented by cache memory.
  • processing circuitry 46 may include memory (eg, cache, etc.) for implementing an image buffer or the like.
  • video encoder 20 implemented by logic circuitry may include an image buffer (eg, implemented by processing circuitry 46 or memory storage 44 ) and a graphics processing unit (eg, implemented by processing circuitry 46 ).
  • a graphics processing unit may be communicatively coupled to the image buffer.
  • Graphics processing unit may include video encoder 20 implemented by processing circuitry 46 to implement the various modules discussed with reference to FIG. 2 and/or any other encoder system or subsystem described herein.
  • Logic circuits may be used to perform the various operations discussed herein.
  • video decoder 30 may be implemented by processing circuitry 46 in a similar manner to implement the various aspects discussed with reference to video decoder 30 of FIG. 3 and/or any other decoder system or subsystem described herein. module.
  • logic circuit implemented video decoder 30 may include an image buffer (implemented by processing circuit 46 or memory storage 44 ) and a graphics processing unit (eg, implemented by processing circuit 46 ).
  • a graphics processing unit may be communicatively coupled to the image buffer.
  • Graphics processing unit may include video decoder 30 implemented by processing circuitry 46 to implement the various modules discussed with reference to FIG. 3 and/or any other decoder system or subsystem described herein.
  • antenna 42 may be used to receive an encoded bitstream of video data.
  • an encoded bitstream may contain data related to encoded video frames, indicators, index values, mode selection data, etc., as discussed herein, such as data related to encoding partitions (e.g., transform coefficients or quantized transform coefficients , (as discussed) an optional indicator, and/or data defining an encoding split).
  • Video coding system 40 may also include video decoder 30 coupled to antenna 42 and used to decode the encoded bitstream.
  • a display device 45 is used to present video frames.
  • the video decoder 30 may be used to perform a reverse process.
  • the video decoder 30 may be configured to receive and parse such syntax elements and decode the associated video data accordingly.
  • video encoder 20 may entropy encode the syntax elements into an encoded video bitstream.
  • video decoder 30 may parse such syntax elements and decode the related video data accordingly.
  • VVC Very video coding
  • VCEG Video Coding Experts Group
  • MPEG Motion Picture Experts Group
  • HEVC High-Efficiency Video Coding
  • JCT-VC Joint Collaboration Team on Video Coding
  • FIG. 2 is a schematic diagram of a video decoding device 400 provided by an embodiment of the present invention.
  • the video coding apparatus 400 is suitable for implementing the disclosed embodiments described herein.
  • the video decoding device 400 may be a decoder, such as the video decoder 30 in FIG. 1A , or an encoder, such as the video encoder 20 in FIG. 1A .
  • the video decoding device 400 includes: an input port 410 (or input port 410) for receiving data and a receiving unit (receiver unit, Rx) 420; a processor, a logic unit or a central processing unit (central processing unit) for processing data , CPU) 430;
  • the processor 430 here can be a neural network processor 430; a sending unit (transmitter unit, Tx) 440 and an output port 450 (or output port 450) for transmitting data; memory 460.
  • the video decoding device 400 may also include an optical-to-electrical (OE) component and an electrical-to-optical (EO) component coupled to the input port 410, the receiving unit 420, the transmitting unit 440 and the output port 450, For the exit or entrance of optical or electrical signals.
  • OE optical-to-electrical
  • EO electrical-to-optical
  • the processor 430 is realized by hardware and software.
  • Processor 430 may be implemented as one or more processor chips, cores (eg, multi-core processors), FPGAs, ASICs, and DSPs.
  • the processor 430 is in communication with the ingress port 410 , the receiving unit 420 , the transmitting unit 440 , the egress port 450 and the memory 460 .
  • the processor 430 includes a decoding module 470 (eg, a neural network NN based decoding module 470 ).
  • the decoding module 470 implements the embodiments disclosed above. For example, the decode module 470 performs, processes, prepares, or provides for various encoding operations.
  • decoding module 470 is implemented as instructions stored in memory 460 and executed by processor 430 .
  • Memory 460 including one or more magnetic disks, tape drives, and solid-state drives, may be used as an overflow data storage device for storing programs when such programs are selected for execution, and for storing instructions and data that are read during execution of the programs.
  • Memory 460 may be volatile and/or nonvolatile, and may be a read-only memory (ROM), random access memory (random access memory, RAM), ternary content-addressable memory (ternary content-addressable memory (TCAM) and/or static random-access memory (static random-access memory, SRAM).
  • ROM read-only memory
  • RAM random access memory
  • TCAM ternary content-addressable memory
  • SRAM static random-access memory
  • Fig. 3 is a simplified block diagram of an apparatus 500 provided by an exemplary embodiment, and the apparatus 500 may be used as either or both of the source device 12 and the destination device 14 in Fig. 1A.
  • Processor 502 in device 500 may be a central processing unit.
  • processor 502 may be any other type of device or devices, existing or to be developed in the future, capable of manipulating or processing information. While the disclosed implementations can be implemented using a single processor, such as processor 502 as shown, it is faster and more efficient to use more than one processor.
  • memory 504 in apparatus 500 may be a read only memory (ROM) device or a random access memory (RAM) device. Any other suitable type of storage device may be used as memory 504 .
  • Memory 504 may include code and data 506 accessed by processor 502 via bus 512 .
  • Memory 504 may also include an operating system 508 and application programs 510, including at least one program that allows processor 502 to perform the methods described herein.
  • application programs 510 may include applications 1 through N, and also include a video coding application that performs the methods described herein.
  • Apparatus 500 may also include one or more output devices, such as display 518 .
  • display 518 may be a touch-sensitive display that combines the display with touch-sensitive elements that may be used to sense touch input.
  • Display 518 may be coupled to processor 502 via bus 512 .
  • bus 512 in device 500 is described herein as a single bus, bus 512 may include multiple buses. Additionally, secondary storage may be directly coupled to other components of device 500 or accessed over a network, and may comprise a single integrated unit such as a memory card or multiple units such as multiple memory cards. Accordingly, apparatus 500 may have a wide variety of configurations.
  • Figure 4 is an example diagram of an end-to-end deep learning image encoding and decoding framework.
  • the image encoding and decoding framework includes an encoding end: an encoding network (Encoder), a quantization module, and an entropy encoding network; a decoding end: an entropy decoding network , decoding network (Decoder), and entropy estimation network.
  • the original image is transformed from the image domain to the feature domain through the processing of the encoding network, and the transformed image features are encoded into a code stream to be transmitted or stored after being processed by the quantization module and the entropy encoding network.
  • the code stream is decoded into image features through the processing of the entropy decoding network, and the image features are transformed from the feature domain to the image domain through the processing of the decoding network, thereby obtaining the reconstructed image.
  • the entropy estimation network obtains the estimated probability value of each feature element according to the image feature estimation, which is used for the processing of the entropy encoding network and the entropy decoding network.
  • both the encoding network (Encoder) and the decoding network (Decoder) have nonlinear transformation units.
  • Figure 5 is an example diagram of an end-to-end deep learning video codec framework, as shown in Figure 5, the video codec framework includes a prediction module (predict model) and a residual compression (residual compress) module,
  • the prediction module uses the reconstructed image of the previous frame to predict the current frame to obtain the predicted image.
  • the residual compression module compresses the residual between the original image of the current frame and the predicted image, and on the other hand, it decompresses to obtain the reconstructed residual , the reconstruction residual and the predicted image are summed to obtain the reconstructed image of the current frame.
  • Both the encoding sub-network and the decoding sub-network in the prediction module and the residual compression module have nonlinear transformation units.
  • both the prediction model (predict model) and the residual compression (residual compress) module have nonlinear transformation units.
  • FIG. 6 is an example diagram of an application scenario according to an embodiment of the present application.
  • the application scenario may be services involving image/video collection, storage or transmission in terminals, cloud servers, and video surveillance, for example, terminal photography/ Recording, photo album, cloud album, video surveillance, etc.
  • Camera collects images/videos.
  • the artificial intelligence (AI) image/video coding network extracts image features from images/videos to obtain image features with low redundancy, and then compresses them based on image features to obtain code stream/image files.
  • AI artificial intelligence
  • the AI image/video decoding network decompresses the code stream/image file to obtain image features, and then performs inverse feature extraction on image features to obtain a reconstructed image/video.
  • the storage/transmission module stores the compressed code stream/image file for different services (for example, terminal photography, video monitoring, cloud server, etc.) or transmits (for example, cloud service, live broadcast technology, etc.).
  • services for example, terminal photography, video monitoring, cloud server, etc.
  • transmits for example, cloud service, live broadcast technology, etc.
  • Fig. 7 is an example diagram of an application scenario according to an embodiment of the present application.
  • the application scenario may be services involving image/video collection, storage or transmission in terminals and video surveillance, for example, terminal photo albums, video surveillance, Live etc.
  • Encoding end The encoding network transforms images/videos into image features with lower redundancy, which usually contain nonlinear transformation units and have nonlinear characteristics.
  • the entropy estimation network is responsible for calculating the encoding probability of each data in the image features.
  • the entropy coding network performs lossless coding on the image features according to the probability corresponding to each data to obtain the code stream/image file, which further reduces the amount of data transmission during the image compression process.
  • the entropy decoding network performs lossless decoding on the code stream/image file according to the probability corresponding to each data to obtain the reconstructed image features.
  • the decoding network inversely transforms the image features output by entropy decoding and parses them into images/videos. Corresponding to the encoding network, it usually contains a nonlinear transformation unit and has nonlinear characteristics.
  • the saving module saves the code stream/image file to the corresponding storage location of the terminal.
  • the loading module loads the code stream/image file from the corresponding storage location of the terminal, and inputs it to the entropy decoding network.
  • Fig. 8 is an example diagram of an application scenario according to an embodiment of the present application.
  • the application scenario may be a service involving image/video collection, storage or transmission in the cloud or video surveillance, for example, cloud photo album, video surveillance, Live etc.
  • Encoding end Acquire images/videos locally, encode images (JPEG) to obtain compressed images/videos, and then send compressed images/videos to the cloud.
  • JPEG JPEG decoding on the compressed image/video to obtain the image/video, and then compresses the image/video to obtain the code stream/image file and stores it.
  • Decoder When the local needs to obtain images/videos from the cloud, the cloud decompresses the code stream/image files to obtain images/videos, then JPEG encodes the images/videos to obtain compressed images/videos, and sends compressed images/videos to the local video. Locally perform JPEG decoding on compressed images/videos to obtain images/videos.
  • an embodiment of the present application provides an image encoding/decoding method to implement efficient nonlinear transformation processing and improve rate-distortion performance in image/video compression algorithms.
  • FIG. 9 is a flowchart of a process 900 of an image encoding method according to an embodiment of the present application.
  • the process 900 may be executed by the encoding end in the foregoing embodiments.
  • the process 900 is described as a series of steps or operations. It should be understood that the process 900 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 9 .
  • Process 900 includes the following steps:
  • Step 901. Acquire the features of the first image to be processed.
  • the first image feature is obtained by the encoder after obtaining the image to be processed and converting it from the image domain to the feature domain.
  • the conversion here may include: 1. Convolution processing, using the convolution layer to extract features , with a local receptive field and a weight sharing mechanism (that is, each filter slides to process input features). 2. Use MLP or fully connected layer to extract features, with global receptive field characteristics, and weights are not shared. 3. Transformer processing, including matrix multiplication, MLP and normalization processing, has global receptive field characteristics and strong ability to capture long-distance dependencies.
  • the first image feature can be expressed as a two-dimensional matrix (L ⁇ C, L represents the length, C represents the channel (channel)) or a three-dimensional matrix (C ⁇ H ⁇ W, C represents the number of channels, H represents the height, W represents the width)
  • L represents the length
  • C represents the channel (channel)
  • C represents the channel (channel)
  • C represents the channel (channel)
  • C represents the channel (channel)
  • C represents the number of channels
  • H represents the height
  • W represents the width
  • the specific form is associated with the aforementioned conversion method, for example, the first image feature extracted by convolution processing or MLP generally corresponds to a three-dimensional matrix, while the first image feature obtained by transformer processing generally corresponds to a two-dimensional matrix .
  • the first image features are represented as a two-dimensional matrix:
  • the two-dimensional matrix A is a 3 ⁇ 3 matrix containing 9 elements, and each element a(i,j) corresponds to an eigenvalue of the first image feature, where i represents the length corresponding to the element a(i,j) , j represents the channel where the element a(i,j) is located.
  • the first image feature is represented as a three-dimensional matrix:
  • the three-dimensional matrix B is a 3 ⁇ 3 ⁇ 2 matrix and contains 18 elements, each element a(i,j,l) corresponds to an eigenvalue of the first image feature, where i represents the element a(i,j,l) The row where l) is located, j indicates the column where the element a(i,j,l) is located, and l indicates the channel where the element a(i,j,l) is located.
  • the embodiment of the present application does not specifically limit the manner of acquiring the first image feature.
  • the image to be processed may be a picture, or an image frame in a video, or an image block obtained by dividing the aforementioned picture or image frame, which is not specifically limited.
  • Step 902 Perform nonlinear transformation processing on the first image features to obtain processed image features.
  • the first nonlinear operation is performed on each feature value in the first image feature to obtain the second image feature; the second image feature is subjected to convolution processing to obtain the third image feature, and the third image
  • the multiple eigenvalues in the feature correspond to the multiple eigenvalues in the first image feature; the corresponding multiple eigenvalues in the first image feature and the third image feature are multiplied point by point to obtain the processed image feature.
  • Figure 10a is a structural diagram of a nonlinear transformation unit with an attention mechanism. As shown in Figure 10a, in the embodiment of the present application, the nonlinear transformation unit is used to implement the above-mentioned nonlinear transformation processing, which includes the first nonlinear operation, volume Product processing and point-by-point multiplication.
  • the first nonlinear operation is an operation performed on each eigenvalue of the above-mentioned first image feature, and may include absolute value operation, ReLU series, Sigmoid, Tanh, PWL operation, and the like. in,
  • Taking the absolute value refers to taking the absolute value of the input eigenvalues. It can be expressed by the following formula:
  • a piecewise linear map includes a rectified linear unit (ReLU), or a leaky rectified linear unit (LeakyReLU), or a PWL operation.
  • ReLU is a piecewise linear mapping method. For the input eigenvalues, the eigenvalues less than 0 are output as 0, and the eigenvalues greater than or equal to 0 are identically output. It can be expressed by the following formula:
  • LeakyReLU is a piecewise linear mapping method. Based on ReLU, the input feature value less than 0 is scaled with a preset weight, usually 0.01. It can be expressed by the following formula:
  • a represents a preset value, usually set to 0.01.
  • Sigmoid can be expressed as the following operation:
  • Tanh can be expressed as the following operation:
  • PWL can be expressed as the following operation:
  • formula (1) can be used for different granularities (by layer or even by channel).
  • B L , B R , Y P , K L , K R are channel-by-channel parameters.
  • the same hyperparameter N can be used in all PWLs.
  • PWL can approximate any continuous bounded scalar function
  • PWL changes continuously with parameters (except hyperparameter N), which is very conducive to gradient-based optimization
  • flexibility is concentrated in the bounded area, so
  • the learnable parameters can be utilized to the maximum; due to uniform segmentation, PWL is computationally efficient, especially in inference.
  • FIG 11 is a schematic diagram of the PWL function.
  • the number of segments N is a hyperparameter that affects the fitting ability of the PWL function. The larger the number of segments, the higher the degrees of freedom and the larger the model capacity.
  • the left and right boundaries BL and BR define the main active area of interest for the PWL. [B L , B R ] is evenly divided into N segments, and N+1 dividing points are obtained. Each boundary point has a corresponding y-axis coordinate value Y P , and these coordinates determine the shape of the PWL curve. For the area outside [B L , B R ], two slopes K L and K R can be used to control the shape of the area outside the boundary.
  • piecewise linear mapping may also use other deformation methods of ReLU, or other new first nonlinear operation methods, which are not specifically limited in this embodiment of the present application.
  • the first image feature is converted into the second image feature, and the second image feature can also be expressed in the form of a matrix like the first image feature, because the first nonlinear operation is for the first image feature
  • the first nonlinear operation is for the first image feature
  • each eigenvalue of the first image feature has an eigenvalue corresponding to it in the second image eigenvalue, so the size of the matrix corresponding to the second image feature and the matrix corresponding to the first image feature
  • the first image feature is expressed as a 3 ⁇ 3 matrix like the above matrix A
  • the second image feature can also be expressed as a 3 ⁇ 3 matrix , but since the first image feature and the second image feature have undergone the first nonlinear operation, the eigenvalues in the first image feature and the second image feature are not exactly the same, correspondingly, the elements in the matrix corresponding to the two The values are not exactly the same either.
  • the third image feature can be considered as the local response (that is, the corrected value) of the second image feature, that is, the third image feature is the second image feature after convolution Because the receptive field of convolution processing is limited, the response value of each position in the image feature output by convolution processing is only related to the input feature value of the position adjacent to the position, so it is called for a local response. Convolution processing can be expressed as the following formula:
  • represents the weight of the convolutional layer
  • represents the bias parameter of the convolutional layer
  • FIG. 12 is a schematic diagram of convolution processing. As shown in Figure 12, a 1*2 matrix is input, and a 1*4 matrix is output through the convolution layer.
  • the convolution layer includes two filters, one is 2 *50 matrix W1, and the other is 50*4 matrix W2. First, the input matrix is convolved by the matrix W1 to obtain a 1*50 matrix, and then the matrix is convoluted by the matrix W2 to obtain a 1*4 output matrix.
  • (i, j) represents the index of the feature value in its image feature
  • a(i, j) represents the feature value of the first image feature
  • b(i, j) represents the feature value of the second image feature
  • c (i,j) represent the eigenvalues of the processed image features.
  • the dimensions of the matrix corresponding to the processed image feature and the matrix corresponding to the first image feature are also the same.
  • the correspondence between multiple eigenvalues in two graphic features may mean that after the two image features are respectively expressed as matrices, the values of the elements at the same position in the two matrices have The operation relationship corresponds to the two.
  • both the first image feature and the third image feature are expressed in the form of the above-mentioned matrix A, corresponding to the elements at the position of a 0,2 .
  • the convolution process can also obtain the third image feature different from the first image, which depends on the internal structure of the convolution process, especially the length, width and number of channels of the filter when performing the convolution process. If the size of the third image feature and the first image feature are different, the elements in the matrix corresponding to the corresponding third image feature and the elements in the matrix corresponding to the first image feature are not one-to-one correspondence. At this time, the first image feature can be considered Multiple elements in the matrix corresponding to the image feature are multiplied by the same element in the matrix corresponding to the third image feature.
  • the channel number of the matrix corresponding to the first image feature is 3, and the channel number of the matrix corresponding to the third image feature is 1, the elements in the matrix corresponding to the third image feature can be the matrix corresponding to the first image feature Elements at the same position in each channel in , are multiplied separately.
  • This embodiment of the present application does not specifically limit it.
  • x represents the input feature value
  • y represents the output feature value
  • represents the weight of the convolutional layer
  • represents the bias parameter of the convolutional layer.
  • the local attention mechanism is realized.
  • Local means that the first nonlinear operation is performed point by point, and each input eigenvalue only needs its own characteristics. Get the output value without considering the influence of surrounding feature values; the attention mechanism means that all the feature values in the first image feature, some are important, some are redundant, and the output of convolution processing can be each feature in the image feature.
  • the weight of the value can correct the original eigenvalues, highlight the important eigenvalues, and suppress the redundant eigenvalues; the point-by-point multiplication operation uses the aforementioned local information to correct the value of each eigenvalue on the first image feature, avoiding Restrictions on the convolution parameters.
  • the first nonlinear operation is performed on each feature value in the first image feature to obtain the second image feature; the second image feature is subjected to convolution processing to obtain the third image feature, and the third image
  • a plurality of eigenvalues in the feature correspond to a plurality of eigenvalues in the first image feature;
  • a point-by-point multiplication operation is performed on a plurality of eigenvalues corresponding to the first image feature and the third image feature to obtain the fourth image feature
  • the multiple eigenvalues in the fourth image feature correspond to the multiple eigenvalues in the first image feature;
  • the corresponding multiple eigenvalues in the first image feature and the fourth image feature are added point by point to obtain processed image features.
  • Figure 10b is a structural diagram of a residual nonlinear transformation unit with an attention mechanism.
  • the nonlinear transformation unit is used to implement the above-mentioned nonlinear transformation processing, which includes the first nonlinear operation , convolution processing, point-by-point multiplication and point-by-point addition.
  • the first nonlinear operation, convolution processing, and point-by-point multiplication operations can refer to the description in the previous implementation mode, and will not be repeated here.
  • the corresponding multiple feature values in the first image feature and the fourth image feature are added point by point, that is, the initial input of the nonlinear transformation unit and the point-by-point multiplication operation
  • the output is added point by point, which can be expressed as the following formula:
  • (i, j) represents the index of the feature value in its image feature
  • a(i, j) represents the feature value of the first image feature
  • c(i, j) represents the feature value of the fourth image feature
  • sum (i,j) represent the eigenvalues of the processed image features.
  • the point-by-point addition operation is a first-addition residual structure, which can make the codec network using the above processing process converge more easily during training.
  • x represents the input feature value
  • y represents the output feature value
  • represents the weight of the convolutional layer
  • represents the bias parameter of the convolutional layer.
  • the convolution processing conv1(x) is similar to the convolution processing conv2(x), the difference is that the convolution parameter in the convolution processing conv2(x) is added by 1.
  • the conversion between the above two implementations can be realized by fine-tuning the convolution parameter ⁇ in the convolution process, that is, if the nonlinear transformation unit does not include point-by-point addition operations, the convolution process conv1( x); If the nonlinear transformation unit contains a point-by-point addition operation, convolution processing conv2(x) can be used.
  • the second nonlinear operation may be performed on the third image feature, and then the output processed third image feature and the first image feature Perform a point-by-point multiplication operation, or, on the basis of the above-mentioned second embodiment, perform a second non-linear operation on the third image feature, and then output the processed third image feature and the first image feature Perform point-by-point multiplication. That is, a second nonlinear operation is added in the nonlinear transformation processing, the input of the second nonlinear operation is the output of the convolution processing in the nonlinear transformation processing, and the output of the second nonlinear operation is as a point-by-point phase An input to the multiply operation.
  • Figure 10c is a structural diagram of a residual nonlinear transformation unit with an attention mechanism, as shown in Figure 10c, in the embodiment of the present application, the nonlinear transformation unit is used to implement the above-mentioned nonlinear transformation processing, which includes the first nonlinear operation , convolution processing, second nonlinear operation and point-by-point multiplication operation.
  • the first nonlinear operation, the convolution processing and the point-by-point multiplication operation may refer to the description of the embodiment shown in FIG. 10 a , which will not be repeated here.
  • the second nonlinear operation may use the same operation method as the first nonlinear operation, or may use a different operation method, which may include taking absolute value, ReLU, LeakyReLU, etc., which is not specifically limited in this embodiment of the present application.
  • x represents the input feature value
  • y represents the output feature value
  • represents the weight of the convolutional layer
  • represents the bias parameter of the convolutional layer
  • Figure 10d is a structural diagram of a nonlinear transformation unit with an attention mechanism.
  • the nonlinear transformation unit is used to realize the above-mentioned nonlinear transformation processing, which includes the first nonlinear operation, convolution Product processing, second nonlinear operation, point-by-point multiplication and point-by-point addition.
  • the second nonlinear operation may use the same operation method as the first nonlinear operation, or may use a different operation method, which may include taking absolute value, ReLU, LeakyReLU, etc., which is not specifically limited in this embodiment of the present application.
  • the piecewise linear mapping can use different number of segments; the mapping slope on each segment can be determined by training or directly specifying; for the input feature image
  • Each channel of can use a different piecewise linear function, or use the same piecewise linear function for all channels, or use the same piecewise linear function for several channels.
  • the residual structure is no longer fused with the convolution, but can be fused with the piecewise linear function, that is, +1 is added to the output of the original piecewise linear function to form a new The piecewise linear function of .
  • x represents the input feature value
  • y represents the output feature value
  • represents the weight of the convolutional layer
  • represents the bias parameter of the convolutional layer
  • Step 903 Perform encoding according to the processed image features to obtain a code stream.
  • the encoding side can continue to perform convolution processing on them, or perform nonlinear transformation processing on the output of the convolution processing after convolution processing, and then perform entropy coding on the results of the foregoing processing to obtain
  • the code stream is obtained, and the entropy coding can be realized by using the entropy coding network in the embodiments shown in Fig. 4 to Fig. 8 , which will not be repeated here.
  • Other encoding methods may also be used to encode the result of the foregoing processing to obtain a code stream, which is not limited in this application.
  • Figure 13 is a structural diagram of the encoding network.
  • the encoding network includes 4 convolutional layers (conv) and 3 nonlinear transformation units. Convolution processing, nonlinear transformation processing, convolution processing, nonlinear transformation processing, convolution processing, nonlinear transformation processing, and convolution processing are performed sequentially to obtain output image features.
  • the nonlinear transformation unit may adopt the structure of the embodiment shown in Fig. 10a or Fig. 10b. Then the image features are output and entropy coding is performed, which will not be repeated here.
  • the input eigenvalues only need their own characteristics to obtain the output value, without considering the influence of the surrounding eigenvalues, and the original eigenvalues Correction, highlighting important eigenvalues, suppressing redundant eigenvalues, and correcting the value of each eigenvalue on the first image feature, avoiding the limitation of convolution parameters, so as to achieve efficient encoding in the network
  • Non-linear transformation processing to further improve the rate-distortion performance in image/video compression algorithms.
  • FIG. 14 is a flowchart of a process 1300 of an image decoding method according to an embodiment of the present application.
  • the process 1300 may be executed by the decoding end in the foregoing embodiments.
  • the process 1300 is described as a series of steps or operations. It should be understood that the process 1300 may be performed in various orders and/or concurrently, and is not limited to the order of execution shown in FIG. 14 .
  • Process 1300 includes the following steps:
  • Step 1301. Acquire the features of the first image to be processed.
  • the decoding end corresponds to the encoding end
  • the first image feature can be that after the decoding end performs entropy decoding on the code stream, the entropy decoding can be implemented by using the entropy decoding network in the embodiments shown in Figures 4 to 8, which will not be repeated here. Afterwards, it is obtained through convolution processing, deconvolution processing, transposed convolution processing, interpolation + convolution processing, and Transformer processing. It should be understood that after the above-mentioned processing, the size of the output first image feature will be restored (mirror-symmetrical to the encoding end), and the size of the input image feature may also change, the number of channels may change, etc., which are not specifically limited.
  • the foregoing processing is opposite to the transformation in step 901 in the embodiment shown in FIG. 9 .
  • the first image feature may also be represented in the form of a two-dimensional matrix or a three-dimensional matrix.
  • the principle refer to the description in step 901, and details will not be repeated here.
  • Step 1302 Perform nonlinear transformation processing on the first image feature to obtain processed image feature.
  • the first nonlinear operation is performed on each feature value in the first image feature to obtain the second image feature; the second image feature is subjected to convolution processing to obtain the third image feature, and the third image
  • the multiple eigenvalues in the feature correspond to the multiple eigenvalues in the first image feature; the corresponding multiple eigenvalues in the first image feature and the third image feature are multiplied point by point to obtain the processed image feature.
  • the first nonlinear operation is performed on each feature value in the first image feature to obtain the second image feature; the second image feature is subjected to convolution processing to obtain the third image feature, and the third image
  • a plurality of eigenvalues in the feature correspond to a plurality of eigenvalues in the first image feature;
  • a point-by-point multiplication operation is performed on a plurality of eigenvalues corresponding to the first image feature and the third image feature to obtain the fourth image feature
  • the multiple eigenvalues in the fourth image feature correspond to the multiple eigenvalues in the first image feature;
  • the corresponding multiple eigenvalues in the first image feature and the fourth image feature are added point by point to obtain processed image features.
  • Step 1303. Obtain a reconstructed image according to the processed image features.
  • the decoding side can also continue to perform convolution processing on it, or perform nonlinear transformation processing on the output of the convolution processing after convolution processing, thereby converting it from the feature domain to the image domain , to get the reconstructed image.
  • Figure 15 is a structural diagram of the decoding network.
  • the decoding network includes 4 deconvolution layers (Deconv) and 3 nonlinear transformation units, and the deconvolution layers and nonlinear transformation units are arranged crosswise, that is, The input image is sequentially subjected to deconvolution processing, nonlinear transformation processing, deconvolution processing, nonlinear transformation processing, deconvolution processing, nonlinear transformation processing and deconvolution processing to obtain a reconstructed image.
  • the nonlinear transformation unit may adopt the structure of the embodiment shown in Fig. 10a or Fig. 10b. Then the output image features are undergoing entropy decoding, which will not be repeated here.
  • the input eigenvalues only need their own characteristics to obtain the output value, without considering the influence of the surrounding eigenvalues, and the original eigenvalues Correction, highlighting important eigenvalues, suppressing redundant eigenvalues, and correcting the value of each eigenvalue on the first image feature, avoiding the limitation of convolution parameters, so as to achieve efficient encoding in the network
  • Non-linear transformation processing to further improve the rate-distortion performance in image/video compression algorithms.
  • the embodiment of the present application also provides a training method for the encoding/decoding network, which may include: first constructing an end-to-end encoding/decoding network, including an encoder Encoder, a decoder Decoder, and an entropy estimation unit.
  • the codec network is regarded as a whole, and the training data (image or video) is input to the encoder to obtain the feature data; on the one hand, the feature data is calculated by the entropy estimation unit to calculate the coding bit rate overhead, and the bit rate loss is obtained; the other feature data
  • the reconstruction data is output through the decoding end, the reconstruction data and the input data are used to calculate the degree of distortion, and the distortion loss is obtained.
  • the backpropagation algorithm updates the learnable parameters in the model through the weighted loss composed of the bit rate loss and the distortion loss. After training, the parameters of all sub-modules in the model are fixed.
  • the encoder Encoder and the entropy estimation unit are divided into the encoding end for encoding the data to be encoded into a code stream file; the decoder and the entropy estimation unit are divided into the decoding end for reconstructing the data from the code stream file.
  • ResAU nonlinear transformation unit
  • FIG. 16a is an exemplary structure diagram of ResAU, as shown in Figure 16a, ResAU adopts the structure shown in Figure 10b, including the first nonlinear operation, convolution processing, point-by-point multiplication and point-by-point addition, Wherein the first nonlinear operation adopts absolute value (abs).
  • the ResAU can be applied to the encoding network shown in FIG. 13 or the decoding network shown in FIG. 15 .
  • the ResAU of this embodiment can be expressed as the following formula:
  • the compression performance test is performed on the ResAU shown in Fig. 16a.
  • Test set Kodak test set, containing 24 portable network graphics (PNG) images with a resolution of 768 ⁇ 512 or 512 ⁇ 768.
  • PNG portable network graphics
  • ResAU shown in Figure 16a is used in the encoder/decoder network structure with mixed Gaussian super-prior entropy estimation.
  • Figure 17a shows the overall performance of ResAU on the 24 images of the Kodak test set.
  • the RD performance of the GMM network using ResAU is better. More specifically, under the same decoding and reconstruction quality, using ResAU under the GMM encoding/decoding network can save about 12% of the encoding bit rate overhead compared with using ReLU, and can save about 5% of the encoding bit rate overhead compared with using GDN.
  • FIG. 16b is an exemplary structure diagram of ResAU, as shown in Figure 16b, ResAU adopts the structure shown in Figure 10b, including the first nonlinear operation, convolution processing, point-by-point multiplication and point-by-point addition, Wherein the first nonlinear operation adopts a ReLU operation.
  • the ResAU can be applied to the encoding network shown in FIG. 13 or the decoding network shown in FIG. 15 .
  • the ResAU of this embodiment can be expressed as the following formula:
  • the compression performance test is performed on the ResAU shown in Fig. 16b.
  • Test set 24 Kodak test images.
  • ResAU shown in Figure 16b is used in the encoding/decoding network structure of mixed Gaussian super prior entropy estimation.
  • the nonlinear operations in ResAU were replaced by ReLU and LeakyReLU respectively.
  • the comparative experiments included the Identity scheme without nonlinear operations and the ResAU scheme for taking absolute values described in Example 1.
  • Figure 17b shows the overall performance of ResAU on the 24 images of the Kodak test set. It can be seen from the RD curve that compared with the Identity scheme that does not use nonlinear operations, the use of point-by-point nonlinear operations can greatly improve the rate-distortion performance of the image compression network. The effect of the nonlinear operation of the ReLU class is slightly better than that of the absolute value of the nonlinear operation.
  • FIG. 16c is an exemplary structure diagram of ResAU, as shown in Figure 16c, ResAU adopts the structure shown in Figure 10d, including the first nonlinear operation, convolution processing, second nonlinear operation, point-by-point multiplication and A point-by-point addition operation, wherein both the first nonlinear operation and the second nonlinear operation use PWL.
  • the ResAU can be applied to the encoding network shown in FIG. 13 or the decoding network shown in FIG. 15 .
  • ResAU introduces nonlinear characteristics through the piecewise linear function PWL; realizes the local attention mechanism through convolution, piecewise linear function PWL and point-by-point multiplication operations, and uses local information to correct the responses of each channel on each position on the feature map; step by step
  • the point multiplication operation can avoid the problem of limited value space of learnable parameters in GDN; in addition, the connection in the form of end-to-end residuals can make the network easier to converge during training.
  • the ResAU of this embodiment can be expressed as the following formula:
  • is the weight of the convolutional layer
  • is the bias parameter of the convolutional layer
  • the first PWL operation is a piecewise linear function that can provide nonlinear characteristics for the overall transformation. Under the action of this operation, each value of the input data will be calculated according to the value range where it is located using different mapping relationships to obtain the output value.
  • feature maps of different channel dimensions can be operated with the same or different piecewise linear mapping functions. Its parameters can be preset values, or can be learned through training.
  • Convolution processing In ResAU, the input of convolution processing is the output of nonlinear operation.
  • the input tensor is processed by convolution to obtain a tensor of constant size, and the output tensor can be considered as the local response of the input tensor.
  • This function is a piecewise linear function, which can scale and map the output of the convolution and also provide nonlinear characteristics for the overall transformation. Under the action of this operation, each value of the input data will be calculated according to the value range where it is located using different mapping relationships to obtain the output value.
  • feature maps of different channel dimensions can be operated with the same or different piecewise linear mapping functions. Its parameters can be preset values, or can be learned through training.
  • Point-by-point multiplication operation In ResAU, the input of the point-by-point multiplication operation is the original input of the unit and the output of the convolution process. The size of the two input tensors is the same, the multiplication operation of the corresponding position data is performed, and the size of the output tensor is also the same as the size of the input tensor.
  • the combination of convolution processing and point-wise multiplication operation implements a local attention mechanism, which uses the local information of input features to correct the response of each position on the feature map. At the same time, the point-by-point multiplication operation avoids the problem of limited value space of learnable parameters existing in the mainstream nonlinear unit GDN currently used in image and video compression networks.
  • Point-by-point addition operation In ResAU, the input of the point-by-point addition operation is the original input of the unit and the output of the point-by-point multiplication operation. This operation is a first-add residual structure, which can make it easier for the encoder-decoder network using this nonlinear unit to converge during training.
  • the residual attention nonlinear unit using the piecewise linear function PWL described in this embodiment can be used in end-to-end image compression networks and video compression networks based on deep neural networks. More specifically, it is generally used in the encoding module (Encoder) and decoding module (Decoder) in the end-to-end image compression network, for the encoding module and decoding of the end-to-end video compression network prediction subnetwork and residual compression subnetwork module.
  • FIG. 16d is an exemplary structural diagram of ResAU.
  • ResAU adopts the structure shown in FIG. 10d, including the first nonlinear operation, convolution processing, the second nonlinear operation, point-by-point Multiplication operation and point-by-point addition operation, wherein the first nonlinear operation uses LeakyReLU, and the second nonlinear operation uses tanh.
  • the convolution process may be conv1 ⁇ 1, which means that the size of the convolution kernel (or convolution operator) is 1 ⁇ 1.
  • the ResAU can be applied to the encoding network shown in FIG. 13 or the decoding network shown in FIG. 15 .
  • FIG. 16e is an exemplary structure diagram of ResAU, as shown in Figure 16e, ResAU adopts the structure shown in Figure 10c, including the first nonlinear operation, convolution processing, second nonlinear operation and point-by-point multiplication operation, Wherein the first nonlinear operation and the second nonlinear operation both use PWL.
  • the ResAU can be applied to the encoding network shown in FIG. 13 or the decoding network shown in FIG. 15 .
  • the ResAU of this embodiment can be expressed as the following formula:
  • is the weight of the convolutional layer
  • is the bias parameter of the convolutional layer
  • the first PWL operation is a piecewise linear function that can provide nonlinear characteristics for the overall transformation. Under the action of this operation, each value of the input data will be calculated according to the value range where it is located using different mapping relationships to obtain the output value.
  • feature maps of different channel dimensions can be operated with the same or different piecewise linear mapping functions. Its parameters can be preset values, or can be learned through training.
  • Convolution processing In ResAU, the input of convolution processing is the output of nonlinear operation.
  • the input tensor is processed by convolution to obtain a tensor of constant size, and the output tensor can be considered as the local response of the input tensor.
  • This function is a piecewise linear function, which can scale and map the output of the convolution and also provide nonlinear characteristics for the overall transformation. Under the action of this operation, each value of the input data will be calculated according to the value range where it is located using different mapping relationships to obtain the output value.
  • feature maps of different channel dimensions can be operated with the same or different piecewise linear mapping functions. Its parameters can be preset values, or can be learned through training.
  • Point-by-point multiplication operation In ResAU, the input of the point-by-point multiplication operation is the original input of the unit and the output of the convolution process. The size of the two input tensors is the same, the multiplication operation of the corresponding position data is performed, and the size of the output tensor is also the same as the size of the input tensor.
  • the combination of convolution processing and point-wise multiplication operation implements a local attention mechanism, which uses the local information of input features to correct the response of each position on the feature map. At the same time, the point-by-point multiplication operation avoids the problem of limited value space of learnable parameters existing in the mainstream nonlinear unit GDN currently used in image and video compression networks.
  • the non-residual-structured attention nonlinear unit using piecewise linear functions described in this embodiment can be used in end-to-end image compression networks and video compression networks based on deep neural networks. More specifically, it is generally used in the encoding module Encoder and decoding module Decoder in the end-to-end image compression network, and in the encoding module and decoding module of the end-to-end video compression network prediction subnetwork and residual compression subnetwork. Moreover, the non-residual attention nonlinear unit using the piecewise linear function described in this embodiment can be obtained by converting the residual attention nonlinear unit using the piecewise linear function.
  • the second PWL operation in the residual attention nonlinear unit using the piecewise linear function can be combined with the point-wise addition operation (that is, the output of the second PWL operation is increased by 1 as a whole to form a new PWL function, and remove the point-wise addition operation), to obtain a corresponding attention nonlinear unit using a piecewise linear function without a residual structure.
  • the preset value of the first PWL operation in this embodiment is the Leaky ReLU function
  • the second PWL operation segment number is 6, and the channel dimension grouping granularity is 1 (that is, the characteristics of all channels use different piecewise linear functions When calculating the output value)
  • it can improve the compression rate and distortion performance by 0.506% compared with the scheme of using Tanh to realize the nonlinear operation after the convolution operation.
  • the solution of this embodiment can save calculation time and power consumption in the point-by-point addition operation.
  • the compression rate distortion performance is reduced by 0.488% compared with the scheme of using Tanh to realize the nonlinear operation after the convolution operation.
  • the solution of this embodiment can save calculation time and power consumption in the point-by-point addition operation.
  • the preset value of the first PWL operation in this embodiment is the Leaky ReLU function, and the number of segments of the second PWL operation is 6, the channel dimensions are not grouped (that is, the features of all channels use the same piecewise linear function to calculate the output value), it reduces the compression rate distortion performance by 0.659% compared with the scheme that uses Tanh to realize nonlinear operation after convolution operation. Moreover, the solution of this embodiment can save calculation time and power consumption in the point-by-point addition operation.
  • FIG. 18 is a schematic structural diagram of an exemplary encoding device 1700 according to an embodiment of the present application. As shown in FIG. 18 , the device 1700 in this embodiment may be applied to an encoding end.
  • the apparatus 1700 may include: an acquisition module 1701 , a transformation module 1702 , an encoding module 1703 and a training module 1704 . in,
  • the obtaining module 1701 is used to obtain the first image feature to be processed; the transformation module 1702 is used to perform nonlinear transformation processing on the first image characteristic to obtain the processed image characteristic, and the nonlinear transformation processing includes the first Non-linear operation, convolution processing and point-by-point multiplication operation; encoding module 1703, configured to perform encoding according to the processed image features to obtain a code stream.
  • the transformation module 1702 is specifically configured to perform the first nonlinear operation on each feature value in the first image feature to obtain the second image feature; Image features are subjected to the convolution process to obtain a third image feature, and a plurality of feature values in the third image feature correspond to a plurality of feature values in the first image feature; the first image feature and the Performing the point-by-point multiplication operation on the corresponding plurality of feature values in the third image feature to obtain the processed image feature.
  • the nonlinear transformation processing further includes a point-by-point addition operation after the point-by-point multiplication operation.
  • the transformation module 1702 is specifically configured to perform the first nonlinear operation on each feature value in the first image feature to obtain the second image feature; Image features are subjected to the convolution process to obtain a third image feature, and a plurality of feature values in the third image feature correspond to a plurality of feature values in the first image feature; the first image feature and the Performing the point-by-point multiplication operation on corresponding multiple feature values in the third image feature to obtain a fourth image feature, the multiple feature values in the fourth image feature and the multiple feature values in the first image feature corresponding to multiple feature values; performing the point-by-point addition operation on multiple feature values corresponding to the first image feature and the fourth image feature to obtain the processed image feature.
  • the nonlinear transformation processing further includes a second nonlinear operation between the convolution processing and the point-by-point multiplication operation, the second nonlinear operation and the The first nonlinear operations are the same or different.
  • the nonlinear operation includes piecewise linear mapping, such as ReLU, LeakyReLU, PWL, Abs, and the like.
  • the nonlinear operation includes a continuous function, such as Tanh, Sigmoid, and the like.
  • the nonlinear operation includes segmented nonlinear operation.
  • a training module 1704 configured to construct a nonlinear transformation unit in the training phase, and the nonlinear transformation unit in the training phase includes a first nonlinear operation layer, a convolution processing layer, a step-by-step A point multiplication operation layer and a point-by-point addition operation layer; a trained nonlinear transformation unit is obtained by training according to pre-acquired training data, and the trained nonlinear transformation unit is used to realize the nonlinear transformation process.
  • the device of this embodiment can be used to execute the technical solution of the method embodiment shown in FIG. 9 , and its implementation principle and technical effect are similar, and details are not repeated here.
  • FIG. 19 is a schematic structural diagram of an exemplary decoding apparatus 1800 according to an embodiment of the present application. As shown in FIG. 19 , the apparatus 1800 of this embodiment may be applied to a decoding end.
  • the apparatus 1800 may include: an acquisition module 1801 , a transformation module 1802 , a reconstruction module 1803 and a training module 1804 . in,
  • the obtaining module 1801 is used to obtain the first image feature to be processed; the transformation module 1802 is used to perform nonlinear transformation processing on the first image characteristic to obtain the processed image characteristic, and the nonlinear transformation processing includes the first Non-linear operation, convolution processing, and point-by-point multiplication operation; reconstruction module 1803, configured to obtain a reconstructed image according to the processed image features.
  • the transformation module 1802 is specifically configured to perform the first nonlinear operation on each feature value in the first image feature to obtain the second image feature;
  • the image features are subjected to convolution processing to obtain a third image feature, and a plurality of feature values in the third image feature correspond to a plurality of feature values in the first image feature; Performing the point-by-point multiplication operation on corresponding multiple feature values in the third image feature to obtain the processed image feature.
  • the nonlinear transformation processing further includes a point-by-point addition operation after the point-by-point multiplication operation.
  • the transformation module 1802 is specifically configured to perform the first nonlinear operation on each feature value in the first image feature to obtain the second image feature;
  • the image features are subjected to convolution processing to obtain a third image feature, and a plurality of feature values in the third image feature correspond to a plurality of feature values in the first image feature;
  • a plurality of corresponding feature values in the third image feature are subjected to the point-by-point multiplication operation to obtain a fourth image feature, and a plurality of feature values in the fourth image feature and a plurality of features in the first image feature
  • the values correspond to each other; performing the point-by-point addition operation on a plurality of feature values corresponding to the first image feature and the fourth image feature to obtain the processed image feature.
  • the nonlinear transformation processing further includes a second nonlinear operation between the convolution processing and the point-by-point multiplication operation, the second nonlinear operation and the The first nonlinear operations are the same or different.
  • the nonlinear operation includes piecewise linear mapping, such as ReLU, LeakyReLU, PWL, Abs, and the like.
  • the nonlinear operation includes a continuous function, such as Tanh, Sigmoid, and the like.
  • the nonlinear operation includes segmented nonlinear operation.
  • the training module 1804 is used to construct a nonlinear transformation unit in the training phase, and the nonlinear transformation unit in the training phase includes a first nonlinear operation layer, a convolution processing layer, and a point-by-point multiplication An operation layer and a point-by-point addition operation layer; a trained nonlinear transformation unit is obtained through training according to pre-acquired training data, and the trained nonlinear transformation unit is used to realize the nonlinear transformation process.
  • the device of this embodiment can be used to execute the technical solution of the method embodiment shown in FIG. 14 , and its implementation principle and technical effect are similar, and details are not repeated here.
  • An embodiment of the present application provides a code stream, and the code stream is generated by a processor executing any encoding method in the foregoing embodiments.
  • An embodiment of the present application provides a device for storing a code stream, which includes: a receiver and at least one storage medium, the receiver is used to receive the code stream; at least one storage medium is used to store the code stream; the code stream is any of the above-mentioned embodiments A code stream generated by an encoding method.
  • An embodiment of the present application provides a device for transmitting a code stream
  • the device includes: a transmitter and at least one storage medium, at least one storage medium is used to store the code stream, and the code stream is executed by a processor according to any encoding method in the above-mentioned embodiments Generated; the transmitter is used to send the code stream to other electronic devices.
  • the device for transmitting a code stream further includes a receiver and a processor, the receiver is used to receive a user request, and the processor is used to respond to the user request to select a target code stream from a storage medium and instruct the sender to send the target code stream flow.
  • An embodiment of the present application provides a system for distributing code streams.
  • the system includes: at least one storage medium and a streaming media device.
  • At least one storage medium is used to store at least one code stream.
  • At least one code stream includes any one of the code streams according to the first aspect.
  • the code stream generated by the above implementation manner; the streaming media device is used to obtain the target code stream from at least one storage medium, and send the target code stream to the end-side device, wherein the streaming media device includes a content server or a content distribution server.
  • An embodiment of the present application provides a system for distributing code streams.
  • the system includes: a communication interface, configured to receive a user's request for acquiring a target code stream; a processor, configured to determine a storage location of the target code stream in response to a user request; The interface is also used to send the storage location of the target code stream to the user, so that the user obtains the target code stream from the storage location of the target code stream, wherein the target code stream is generated by the processor executing any encoding method in the above-mentioned embodiments .
  • each step of the above-mentioned method embodiment can be completed by an integrated logic circuit of the hardware in the processor or an instruction in the form of software.
  • the processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other Program logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the methods disclosed in the embodiments of the present application may be directly implemented by a hardware coded processor, or executed by a combination of hardware and software modules in the coded processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.
  • the memories mentioned in the above embodiments may be volatile memories or nonvolatile memories, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • SDRAM double data rate synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM direct memory bus random access memory
  • direct rambus RAM direct rambus RAM
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (personal computer, server, or network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the embodiments of the present application.
  • the computer software product may be transmitted from one computer-readable storage medium to another computer-readable storage medium
  • the computer instructions may be transmitted from a website, computer, server or data center by wire (such as coaxial cable, optical fiber , digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to another website site, computer, server or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • Described usable medium can be magnetic medium, (for example, floppy disk, hard disk, magnetic tape), optical medium (for example, DVD), or semiconductor medium (for example solid state disk Solid State Disk (SSD), read-only memory (read-only memory, ROM), random access memory (random access memory, RAM)), etc.
  • magnetic medium for example, floppy disk, hard disk, magnetic tape
  • optical medium for example, DVD
  • semiconductor medium for example solid state disk Solid State Disk (SSD), read-only memory (read-only memory, ROM), random access memory (random access memory, RAM)

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

本申请提供一种图像编解码方法和装置。本申请图像编码方法,包括:获取待处理的第一图像特征;对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理依次包括第一非线性运算、卷积处理和逐点相乘运算;根据所述经处理的图像特征进行编码得到码流。本申请可以避免了对卷积参数的限定,从而实现编/解码网络中的高效的非线性变换处理,进一步提升图像/视频压缩算法中的率失真性能。

Description

图像编解码方法和装置
本申请要求于2021年12月3日提交中国专利局、申请号为202111470979.5、申请名称为“图像编解码方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,尤其涉及图像编解码方法和装置。
背景技术
随着卷积神经网络(convolution neural network,CNN)在图像识别、目标检测等计算机视觉任务中的表现远超传统算法,越来越多的研究者开始探索基于深度学习的图像/视频压缩方法。一些研究者设计出端到端的深度学习图像/视频压缩算法,例如,将编码网络、熵估计网络、熵编码网络、熵解码网络、解码网络等模块作为一个整体同时优化,其中,编码网络和解码网络也可称为变换模块和逆变换模块,一般由卷积层和非线性变换单元组成。
非线性变换单元是图像/视频压缩网络的基础组件之一,其非线性特性的优劣直接影响了压缩算法的率失真性能。因此设计更高效的非线性变换单元是进一步提升图像/视频压缩算法中的率失真性能的关键。
发明内容
本申请实施例提供一种图像编解码方法和装置,可以实现编/解码网络中的高效的非线性变换处理,进一步提升图像/视频压缩算法中的率失真性能。
第一方面,本申请实施例提供一种图像编码方法,包括:获取待处理的第一图像特征;对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理依次包括第一非线性运算、卷积处理和逐点相乘运算;根据所述经处理的图像特征进行编码得到码流。
第一图像特征是编码端在获取到待处理图像后,对其进行从图像域到特征域的转换后得到的,这里的转换可以包括但不限于:1、卷积处理,使用卷积层来进行提取特征,具有局部感受野,权重共享机制(即每一个滤波器滑动处理输入特征)。2、使用多层感知机(multi-layer perceptron,MLP)或全连接层提取特征,具有全局感受野特性,权重不共享。3、变换器(Transformer)处理,其中包括矩阵相乘、MLP和归一化处理,具有全局感受野特性,捕捉长距离依赖能力强。
在一种可能的实现方式中,对第一图像特征中的每个特征值进行非线性运算得到第二图像特征;对第二图像特征进行卷积处理得到第三图像特征,第三图像特征中的多个特征值和第一图像特征中的多个特征值相对应;将第一图像特征和第三图像特征中相对应的多个特征值进行逐点相乘运算得到经处理的图像特征。
第一非线性运算是针对上述第一图像特征中的每个特征值进行的运算,可以包括取绝对值运算、线性整流函数(Rectified Linear Unit,ReLU)系列、Sigmoid、Tanh、分段线性 (piecewise linear,PWL)运算等,其中线性整流函数又称为修正线性单元。
经过第一非线性运算,第一图像特征转换成第二图像特征,第二图像特征和第一图像特征可以表示成矩阵的形式,由于第一非线性运算是针对第一图像特征中的每个特征值的,因此第一图像特征的每个特征值在第二图像特征值中均有一个特征值与其相对应,因此第二图像特征对应的矩阵和第一图像特征对应的矩阵的尺寸相同,且相同位置的特征值(矩阵元素的值)相对应,例如,第一图像特征表示成3×3的矩阵,那么第二图像特征也可以表示成3×3的矩阵,但由于第一图像特征和第二图像特征经过了第一非线性运算,因此第一图像特征和第二图像特征中的特征值不完全相同,相应的,二者分别对应的矩阵中的元素值也不完全相同。
对第二图像特征进行卷积处理,输出第三图像特征,第三图像特征可以认为是第二图像特征的局部响应(即校正后的值),即第三图像特征是第二图像特征经过卷积处理得到的响应信号,由于卷积处理的感受野是有限的,经卷积处理输出的图像特征中的每个位置的响应值只和与该位置相邻位置的输入特征值有关,因此称为局部响应。
由此可见,经过上述非线性运算、卷积处理和逐点相乘运算实现了局部注意力机制,局部是指非线性运算是逐点进行的,输入的各个特征值只需要自身的特性即可得到输出值,不用考虑周边特征值的影响;注意力机制是指,第一图像特征中的所有特征值,有的是重要的,有的是冗余的,卷积处理的输出可以是图像特征中的各个特征值的权重,可以对原特征值进行修正,把重要的特征值突出,把冗余的特征值抑制;逐点相乘运算利用前述局部信息校正第一图像特征上每个特征值的值,不必要求卷积参数必须是正数,避免了对卷积参数的取值空间的限定,可以在更广阔的取值空间中得到更优的卷积参数,从而取得对图像更好的压缩性能。
在一种可能的实现方式中,非线性变换处理还包括在逐点相乘运算之后的逐点相加运算。
在一种可能的实现方式中,对第一图像特征中的每个特征值进行非线性运算得到第二图像特征;对第二图像特征进行卷积处理得到第三图像特征,第三图像特征中的多个特征值和第一图像特征中的多个特征值相对应;将第一图像特征和第三图像特征中相对应的多个特征值进行逐点相乘运算得到第四图像特征,第四图像特征中的多个特征值和第一图像特征中的多个特征值相对应;将第一图像特征和第四图像特征中相对应的多个特征值进行逐点相加运算得到经处理的图像特征。
逐点相加运算是一个首位相加的残差结构,可以使使用上述处理过程的编解码网络在训练时更容易收敛。卷积处理conv1(x)和卷积处理conv2(x)相似,区别在于卷积处理conv2(x)中的卷积的偏置参数β参数多加了1。这样可以通过对卷积处理中的卷积中的偏置参数β的微调,实现上述两种实施方式之间的转换,即如果非线性变换单元包含卷积处理conv1(x)和逐点相加运算,则可以把两者融合成为卷积处理conv2(x),从而省略逐点相加运算。得到经处理的图像特征后,编码侧可以继续对其进行卷积处理,也可以在卷积处理之后再次对卷积处理输出的特征进行非线性变换处理,然后再对前述处理的结果特征进行编码以得到码流。
本申请实施例,通过对编码网络中的非线性变换处理进行改变,可以使输入的各个特征值只需要自身的特性即可得到输出值,不用考虑周边特征值的影响,并对原特征值进行 修正,把重要的特征值突出,把冗余的特征值抑制,此外还可以校正第一图像特征上每个特征值的值,避免了对卷积参数的限定,从而实现编码网络中的高效的非线性变换处理,进一步提升图像/视频压缩算法的率失真性能。
非线性运算是针对上述第一图像特征中的每个特征值进行的运算,可以包括采用分段线性映射方法,该方法可以是取绝对值(针对输入的特征值取其绝对值),也可以是修正线性单元(ReLU)或者泄漏修正线性单元(LeakyReLU),也可以是分段线性(PWL)运算。其中,ReLU是一种分段线性映射方法,针对输入的特征值,小于0的特征值输出为0,大于或等于0的特征值恒等输出;LeakyReLU是一种分段线性映射方法,其在ReLU的基础上,将小于0的输入特征值用预先设定的权重进行缩放,权重通常为0.01;PWL运算也是一种分段线性映射方法,PWL运算中的分段数可以更多,其具体定义请参照下文实施例部分。除此以外,非线性运算还可以包括其它方法,例如分段非线性运算、Tanh、Sigmoid,本申请实施例对此不做具体限定。
在一种可能的实现方式中,上述非线性变换处理还包括在上述卷积处理和逐点相乘运算之间的第二非线性运算,第二非线性运算和第一非线性运算相同或不同。例如,第一非线性运算可以是取绝对值运算,第二非线性运算可以仍然是取绝对值运算,第二非线性运算也可以是其他分段线性映射方法或其他非线性运算;或者,第一非线性运算可以是ReLU,第二非线性运算可以仍然是ReLU,第二非线性运算也可以是Sigmoid或其他非线性运算;或者,第一非线性运算可以是LeakyReLU,第二非线性运算可以仍然是LeakyReLU,第二非线性运算也可以是Tanh或其他非线性运算;或者,第一非线性运算可以是PWL,第二非线性运算可以仍然是PWL或其他非线性运算。当第二非线性运算使用PWL实现时,分段线性映射可以采用不同的分段数量;在各分段上的映射斜率可以通过训练的方式或者直接指定的方式确定;对输入特征图像的每个通道可以采用不同的分段线性函数,或者对所有通道都是用同一个分段线性函数,或者对若干个通道采用同一个分段线性函数进行处理。在这种实现方式下,模型训练完毕之后,残差结构不再与卷积融合,而是可与分段线性函数融合,即在原来的分段线性函数的输出上+1,组合成一个新的分段线性函数。
在一种可能的实现方式中,还可以构建训练阶段的非线性变换单元,所述训练阶段的非线性变换单元包括非线性运算层、卷积处理层、逐点相乘运算层和逐点相加运算层;
根据预先获取的训练数据训练得到经训练的非线性变换单元,所述经训练的非线性变换单元用于实现所述非线性变换处理。
第二方面,本申请实施例提供一种图像解码方法,包括:获取待处理的第一图像特征;对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理依次包括第一非线性运算、卷积处理和逐点相乘运算;根据所述经处理的图像特征获取重建图像。
在一种可能的实现方式中,所述对所述第一图像特征进行非线性变换处理得到经处理的图像特征,包括:对所述第一图像特征中的每个特征值进行所述非线性运算得到第二图像特征;对所述第二图像特征进行卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到所述经处理的图像特征。
在一种可能的实现方式中,所述非线性变换处理还包括逐点相加运算。
在一种可能的实现方式中,所述对所述第一图像特征进行非线性变换处理得到经处理的图像特征,包括:对所述第一图像特征中的每个特征值进行所述非线性运算得到第二图像特征;对所述第二图像特征进行卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到第四图像特征,所述第四图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第四图像特征中相对应的多个特征值进行所述逐点相加运算得到所述经处理的图像特征。
在一种可能的实现方式中,所述非线性运算包括分段线性映射,例如ReLU,LeakyReLU,PWL,Abs等。在另一种可能的实现方式中,所述非线性运算包括连续函数,例如Tanh,Sigmoid等。在另一种可能的实现方式中,所述非线性运算包括分段的非线性运算。
上述第二方面及其可能的实现方式中提供的图像解码方法的技术效果可以参考上述第一方面及其对应的可能的实现方式中提供的图像编码方法的技术效果,此处不再赘述。
第三方面,本申请实施例提供一种编码装置,包括:获取模块,用于获取待处理的第一图像特征;变换模块,用于对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理依次包括第一非线性运算、卷积处理和逐点相乘运算;编码模块,用于根据所述经处理的图像特征进行编码得到码流。
在一种可能的实现方式中,所述变换模块,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到所述经处理的图像特征。
在一种可能的实现方式中,所述非线性变换处理还包括在所述逐点相乘运算之后的逐点相加运算。
在一种可能的实现方式中,所述变换模块,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到第四图像特征,所述第四图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第四图像特征中相对应的多个特征值进行所述逐点相加运算得到所述经处理的图像特征。
在一种可能的实现方式中,所述非线性变换处理还包括在所述卷积处理和所述逐点相乘运算之间的第二非线性运算,所述第二非线性运算和所述第一非线性运算相同或不同。例如,第一非线性运算可以是取绝对值运算,第二非线性运算可以仍然是取绝对值运算,第二非线性运算也可以说分段线性映射或其他非线性运算。
在一种可能的实现方式中,所述非线性运算包括分段线性映射,例如ReLU,LeakyReLU,PWL,Abs等。在另一种可能的实现方式中,所述非线性运算包括连续函数,例如Tanh,Sigmoid等。在另一种可能的实现方式中,所述非线性运算包括分段的非线性 运算。
在一种可能的实现方式中,还包括:训练模块,用于构建训练阶段的非线性变换单元,所述训练阶段的非线性变换单元包括第一非线性运算层、卷积处理层、逐点相乘运算层和逐点相加运算层;根据预先获取的训练数据训练得到经训练的非线性变换单元,所述经训练的非线性变换单元用于实现所述非线性变换处理。
第四方面,本申请实施例提供一种解码装置,包括:获取模块,用于获取待处理的第一图像特征;变换模块,用于对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理包括第一非线性运算、卷积处理和逐点相乘运算;重建模块,用于根据所述经处理的图像特征获取重建图像。
在一种可能的实现方式中,所述变换模块,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到所述经处理的图像特征。
在一种可能的实现方式中,所述非线性变换处理还包括在所述逐点相乘运算之后的逐点相加运算。
在一种可能的实现方式中,所述变换模块,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到第四图像特征,所述第四图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第四图像特征中相对应的多个特征值进行所述逐点相加运算得到所述经处理的图像特征。
在一种可能的实现方式中,所述非线性变换处理还包括在所述卷积处理和所述逐点相乘运算之间的第二非线性运算,所述第二非线性运算和所述第一非线性运算相同或不同。
在一种可能的实现方式中,所述非线性运算包括分段线性映射,例如ReLU,LeakyReLU,PWL,Abs等。在另一种可能的实现方式中,所述非线性运算包括连续函数,例如Tanh,Sigmoid等。在另一种可能的实现方式中,所述非线性运算包括分段的非线性运算。
在一种可能的实现方式中,还包括:训练模块,用于构建训练阶段的非线性变换单元,所述训练阶段的非线性变换单元包括第一非线性运算层、卷积处理层、逐点相乘运算层和逐点相加运算层;根据预先获取的训练数据训练得到经训练的非线性变换单元,所述经训练的非线性变换单元用于实现所述非线性变换处理。
第五方面,本申请实施例提供一种编码器,包括:一个或多个处理器;非瞬时性计算机可读存储介质,耦合到所述处理器并存储由所述处理器执行的程序,其中所述程序在由所述处理器执行时,使得所述解码器执行上述第一方面中任一项所述的方法。
第六方面,本申请实施例提供一种解码器,包括:一个或多个处理器;非瞬时性计算机可读存储介质,耦合到所述处理器并存储由所述处理器执行的程序,其中所述程序在由所述处理器执行时,使得所述解码器执行上述第二方面中任一项所述的方法。
第七方面,本申请实施例提供一种包括程序代码的计算机程序产品,当所述程序代码在计算机或处理器上执行时,用于执行上述第一至二方面中任一项所述的方法。
第八方面,本申请实施例提供一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述第一至二方面中任一项所述的方法。
第九方面,本申请实施例提供一种码流,该码流由处理器执行第一方面任一项所述的方法所生成。
第十方面,本申请实施例提供一种存储码流的装置,该装置包括:接收器和至少一个存储介质,接收器用于接收码流;至少一个存储介质用于存储码流;码流为根据第一方面任一项所述的方法生成的码流。
第十一方面,本申请实施例提供一种传输码流的装置,该装置包括:发送器和至少一个存储介质,至少一个存储介质用于存储码流,码流包括处理器根据第一方面的任一项所述的方法生成的码流;发送器用于将码流发送给其他电子设备。
第十二方面,本申请实施例提供一种分发码流的系统,该系统包括:至少一个存储介质和流媒体设备,至少一个存储介质用于存储至少一个码流,至少一个码流包括根据第一方面的任意一种实现方式生成的码流;流媒体设备用于从至少一个存储介质中获取目标码流,并将目标码流发送给端侧设备,其中,流媒体设备包括内容服务器或内容分发服务器。
第十三方面,本申请实施例提供一种分发码流的系统,该系统包括:通信接口,用于接收用户获取目标码流的请求;处理器,用于响应于用户请求,确定目标码流的存储位置;通信接口还用于将目标码流的存储位置发送给用户,以使得用户从目标码流的存储位置获取目标码流,其中,目标码流由处理器执行第一方面任一项所述的方法所生成。
附图说明
图1A为示例性译码系统10的示意性框图;
图1B是视频译码系统40的实例的说明图;
图2为本发明实施例提供的视频译码设备400的示意图;
图3为示例性实施例提供的装置500的简化框图;
图4为端到端的深度学习图像编解码框架的示例图;
图5为端到端的深度学习视频编解码框架的示例图;
图6为本申请实施例的应用场景的示例图;
图7为本申请实施例的应用场景的示例图;
图8为本申请实施例的应用场景的示例图;
图9为本申请实施例的图像编码方法的过程900的流程图;
图10a为具备局部注意力机制的非线性变换单元的结构图;
图10b为具备局部注意力机制的残差非线性变换单元的结构图;
图10c为具备注意力机制的残差非线性变换单元的结构图;
图10d为具备注意力机制的非线性变换单元的结构图;
图11为PWL函数示意图;
图12为卷积处理的一个示意图;
图13为编码网络的结构图;
图14为本申请实施例的图像解码方法的过程1300的流程图;
图15为解码网络的结构图;
图16a为ResAU的一个示例性的结构图;
图16b为ResAU的一个示例性的结构图;
图16c为ResAU的一个示例性的结构图;
图16d为ResAU的一个示例性的结构图;
图16e为ResAU的一个示例性的结构图;
图17a为ResAU在Kodak测试集的24张图像上的整体表现;
图17b为ResAU在Kodak测试集的24张图像上的整体表现;
图18为本申请实施例编码装置1700的一个示例性的结构示意图;
图19为本申请实施例解码装置1800的一个示例性的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请中的附图,对本申请中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书实施例和权利要求书及附图中的术语“第一”、“第二”等仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元。方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
以下是本申请实施例涉及到的术语说明:
1、码率:在图像压缩中,编码单位像素所需要的平均编码长度。
2、率失真性能:用于衡量压缩算法性能的指标,综合考虑了码率和解码图像失真度两项数据。
3、注意力机制:利用有限的注意力资源从大量信息中筛选出高价值信息的手段。可以使神经网络更多关注输入中的相关部分,更少关注不相关的部分。
4、残差结构:神经网络中的一种常用连接结构,计算方式可表示为H(x)=x+f(x)。该结构可防止网络深度增加时可能出现的梯度消失和梯度爆炸问题。
5、非线性变换单元:包含非线性运算(例如,ReLU、Sigmoid、Tanh、PWL等运算)的网络单元,该单元的整体计算方式不符合线性特性。
由于本申请实施例涉及神经网络的应用,为了便于理解,下面先对本申请实施例所使用到的相关名词或术语进行解释说明:
1、神经网络
神经网络(neural network,NN)是机器学习模型,神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022135204-appb-000001
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是ReLU等非线性函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部感受野(local receptive field)相连,来提取局部感受野的特征,局部感受野可以是由若干个神经单元组成的区域。
2、多层感知器(multi-layer perception,MLP)
MLP是一种简单的深度神经网络(deep neural network,DNN)(不同层之间是全连接的),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022135204-appb-000002
其中,
Figure PCTCN2022135204-appb-000003
是输入向量,
Figure PCTCN2022135204-appb-000004
是输出向量,
Figure PCTCN2022135204-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022135204-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2022135204-appb-000007
由于DNN层数多,则系数W和偏移向量
Figure PCTCN2022135204-appb-000008
的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2022135204-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2022135204-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
3、卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。卷积神经网络包含了一个由卷积层和池化层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积 特征平面(feature map)做卷积。
卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。卷积层可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络进行正确的预测。当卷积神经网络有多个卷积层的时候,初始的卷积层往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络深度的加深,越往后的卷积层提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
在经过卷积层/池化层的处理后,卷积神经网络还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络需要利用神经网络层来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层中可以包括多层隐含层,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
可选的,在神经网络层中的多层隐含层之后,还包括整个卷积神经网络的输出层,该输出层具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络 的前向传播完成,反向传播就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络的损失,及卷积神经网络通过输出层输出的结果和理想结果之间的误差。
4、循环神经网络
循环神经网络(recurrent neural networks,RNN)是用来处理序列数据的。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题却无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐含层本层之间的节点不再无连接而是有连接的,并且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含层的输出。理论上,RNN能够对任何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。同样使用误差反向传播算法,不过有一点区别:即,如果将RNN进行网络展开,那么其中的参数,如W,是共享的;而如上举例上述的传统神经网络却不是这样。并且在使用梯度下降算法中,每一步的输出不仅依赖当前步的网络,还依赖前面若干步网络的状态。该学习算法称为基于时间的反向传播算法(Back propagation Through Time,BPTT)。
既然已经有了卷积神经网络,为什么还要循环神经网络?原因很简单,在卷积神经网络中,有一个前提假设是:元素之间是相互独立的,输入与输出也是独立的,比如猫和狗。但现实世界中,很多元素都是相互连接的,比如股票随时间的变化,再比如一个人说了:我喜欢旅游,其中最喜欢的地方是云南,以后有机会一定要去。这里填空,人类应该都知道是填“云南”。因为人类会根据上下文的内容进行推断,但如何让机器做到这一步?RNN就应运而生了。RNN旨在让机器像人一样拥有记忆的能力。因此,RNN的输出就需要依赖当前的输入信息和历史的记忆信息。
5、损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
6、反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播 运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
7、生成式对抗网络
生成式对抗网络(generative adversarial networks,GAN)是一种深度学习模型。该模型中至少包括两个模块:一个模块是生成模型(Generative Model),另一个模块是判别模型(Discriminative Model),通过这两个模块互相博弈学习,从而产生更好的输出。生成模型和判别模型都可以是神经网络,具体可以是深度神经网络,或者卷积神经网络。GAN的基本原理如下:以生成图片的GAN为例,假设有两个网络,G(Generator)和D(Discriminator),其中G是一个生成图片的网络,它接收一个随机的噪声z,通过这个噪声生成图片,记做G(z);D是一个判别网络,用于判别一张图片是不是“真实的”。它的输入参数是x,x代表一张图片,输出D(x)代表x为真实图片的概率,如果为1,就代表130%是真实的图片,如果为0,就代表不可能是真实的图片。在对该生成式对抗网络进行训练的过程中,生成网络G的目标就是尽可能生成真实的图片去欺骗判别网络D,而判别网络D的目标就是尽量把G生成的图片和真实的图片区分开来。这样,G和D就构成了一个动态的“博弈”过程,也即“生成式对抗网络”中的“对抗”。最后博弈的结果,在理想的状态下,G可以生成足以“以假乱真”的图片G(z),而D难以判定G生成的图片究竟是不是真实的,即D(G(z))=0.5。这样就得到了一个优异的生成模型G,它可以用来生成图片。
随着卷积神经网络(convolution neural network,CNN)在图像识别、目标检测等计算机视觉任务中的表现远超传统算法,越来越多的研究者开始探索基于深度学习的图像/视频压缩方法。一些研究者设计出端到端的深度学习图像/视频压缩算法,例如,将编码网络、熵估计网络、熵编码网络、熵解码网络、解码网络等模块作为一个整体同时优化,其中,编码网络和解码网络也可称为变换模块和逆变换模块,一般由卷积层和非线性变换单元组成。
图1A为示例性译码系统10的示意性框图,例如可以利用本申请技术的视频译码系统10(或简称为译码系统10)。视频译码系统10中的视频编码器20(或简称为编码器20)和视频解码器30(或简称为解码器30)代表可用于根据本申请中描述的各种示例执行各技术的设备等。
如图1A所示,译码系统10包括源设备12,源设备12用于将编码图像等编码图像数据21提供给用于对编码图像数据21进行解码的目的设备14。
源设备12包括编码器20,另外即可选地,可包括图像源16、图像预处理器等预处理器(或预处理单元)18、通信接口(或通信单元)22。
图像源16可包括或可以为任意类型的用于捕获现实世界图像等的图像捕获设备,和/或任意类型的图像生成设备,例如用于生成计算机动画图像的计算机图形处理器或任意类型的用于获取和/或提供现实世界图像、计算机生成图像(例如,屏幕内容、虚拟现实(virtual reality,VR)图像和/或其任意组合(例如增强现实(augmented reality,AR)图像)的设备。所述图像源可以为存储上述图像中的任意图像的任意类型的内存或存储器。
为了区分预处理器(或预处理单元)18执行的处理,图像(或图像数据)17也可称为原始图像(或原始图像数据)17。
预处理器18用于接收(原始)图像数据17,并对图像数据17进行预处理,得到预处 理图像(或预处理图像数据)19。例如,预处理器18执行的预处理可包括修剪、颜色格式转换(例如从RGB转换为YCbCr)、调色或去噪。可以理解的是,预处理单元18可以为可选组件。
视频编码器(或编码器)20用于接收预处理图像数据19并提供编码图像数据21(下面将根据图2等进一步描述)。
源设备12中的通信接口22可用于:接收编码图像数据21并通过通信信道13向目的设备14等另一设备或任何其它设备发送编码图像数据21(或其它任意处理后的版本),以便存储或直接重建。
目的设备14包括解码器30,另外即可选地,可包括通信接口(或通信单元)28、后处理器(或后处理单元)32和显示设备34。
目的设备14中的通信接口28用于直接从源设备12或从存储设备等任意其它源设备接收编码图像数据21(或其它任意处理后的版本),例如,存储设备为编码图像数据存储设备,并将编码图像数据21提供给解码器30。
通信接口22和通信接口28可用于通过源设备12与目的设备14之间的直连通信链路,例如直接有线或无线连接等,或者通过任意类型的网络,例如有线网络、无线网络或其任意组合、任意类型的私网和公网或其任意类型的组合,发送或接收编码图像数据(或编码数据)21。
例如,通信接口22可用于将编码图像数据21封装为报文等合适的格式,和/或使用任意类型的传输编码或处理来处理所述编码后的图像数据,以便在通信链路或通信网络上进行传输。
通信接口28与通信接口22对应,例如,可用于接收传输数据,并使用任意类型的对应传输解码或处理和/或解封装对传输数据进行处理,得到编码图像数据21。
通信接口22和通信接口28均可配置为如图1A中从源设备12指向目的设备14的对应通信信道13的箭头所指示的单向通信接口,或双向通信接口,并且可用于发送和接收消息等,以建立连接,确认并交换与通信链路和/或例如编码后的图像数据传输等数据传输相关的任何其它信息,等等。
视频解码器(或解码器)30用于接收编码图像数据21并提供解码图像数据(或解码图像数据)31(下面将根据图3等进一步描述)。
后处理器32用于对解码后的图像等解码图像数据31(也称为重建后的图像数据)进行后处理,得到后处理后的图像等后处理图像数据33。后处理单元32执行的后处理可以包括例如颜色格式转换(例如从YCbCr转换为RGB)、调色、修剪或重采样,或者用于产生供显示设备34等显示的解码图像数据31等任何其它处理。
显示设备34用于接收后处理图像数据33,以向用户或观看者等显示图像。显示设备34可以为或包括任意类型的用于表示重建后图像的显示器,例如,集成或外部显示屏或显示器。例如,显示屏可包括液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light emitting diode,OLED)显示器、等离子显示器、投影仪、微型LED显示器、硅基液晶显示器(liquid crystal on silicon,LCoS)、数字光处理器(digital light processor,DLP)或任意类型的其它显示屏。
译码系统10还包括训练引擎25,训练引擎25用于训练编码器20或解码器30以实 现图像域和特征域的转换。
本申请实施例中训练数据可以存入数据库(未示意)中,训练引擎25基于训练数据训练得到编/解码网络。需要说明的是,本申请实施例对于训练数据的来源不做限定,例如可以是从云端或其他地方获取训练数据进行模型训练。
尽管图1A示出了源设备12和目的设备14作为独立的设备,但设备实施例也可以同时包括源设备12和目的设备14或同时包括源设备12和目的设备14的功能,即同时包括源设备12或对应功能和目的设备14或对应功能。在这些实施例中,源设备12或对应功能和目的设备14或对应功能可以使用相同硬件和/或软件或通过单独的硬件和/或软件或其任意组合来实现。
根据描述,图1A所示的源设备12和/或目的设备14中的不同单元或功能的存在和(准确)划分可能根据实际设备和应用而有所不同,这对技术人员来说是显而易见的。
编码器20(例如视频编码器20)或解码器30(例如视频解码器30)或两者都可通过如图1B所示的处理电路实现,例如一个或多个微处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、离散逻辑、硬件、视频编码专用处理器或其任意组合。编码器20可以通过处理电路46实现,以包含参照图2编码器20论述的各种模块和/或本文描述的任何其它编码器系统或子系统。解码器30可以通过处理电路46实现,以包含参照图3解码器30论述的各种模块和/或本文描述的任何其它解码器系统或子系统。所述处理电路46可用于执行下文论述的各种操作。如图8所示,如果部分技术在软件中实施,则设备可以将软件的指令存储在合适的非瞬时性计算机可读存储介质中,并且使用一个或多个处理器在硬件中执行指令,从而执行本发明技术。视频编码器20和视频解码器30中的其中一个可作为组合编解码器(encoder/decoder,CODEC)的一部分集成在单个设备中,如图1B所示。
源设备12和目的设备14可包括各种设备中的任一种,包括任意类型的手持设备或固定设备,例如,笔记本电脑或膝上型电脑、手机、智能手机、平板或平板电脑、相机、台式计算机、机顶盒、电视机、显示设备、数字媒体播放器、视频游戏控制台、视频流设备(例如,内容业务服务器或内容分发服务器)、广播接收设备、广播发射设备,等等,并可以不使用或使用任意类型的操作系统。在一些情况下,源设备12和目的设备14可配备用于无线通信的组件。因此,源设备12和目的设备14可以是无线通信设备。
在一些情况下,图1A所示的视频译码系统10仅仅是示例性的,本申请提供的技术可适用于视频编码设置(例如,视频编码或视频解码),这些设置不一定包括编码设备与解码设备之间的任何数据通信。在其它示例中,数据从本地存储器中检索,通过网络发送,等等。视频编码设备可以对数据进行编码并将数据存储到存储器中,和/或视频解码设备可以从存储器中检索数据并对数据进行解码。在一些示例中,编码和解码由相互不通信而只是编码数据到存储器和/或从存储器中检索并解码数据的设备来执行。
图1B是视频译码系统40的实例的说明图。视频译码系统40可以包含成像设备41、视频编码器20、视频解码器30(和/或藉由处理电路46实施的视频编/解码器)、天线42、一个或多个处理器43、一个或多个内存存储器44和/或显示设备45。
如图1B所示,成像设备41、天线42、处理电路46、视频编码器20、视频解码器30、 处理器43、内存存储器44和/或显示设备45能够互相通信。在不同实例中,视频译码系统40可以只包含视频编码器20或只包含视频解码器30。
在一些实例中,天线42可以用于传输或接收视频数据的经编码比特流。另外,在一些实例中,显示设备45可以用于呈现视频数据。处理电路46可以包含专用集成电路(application-specific integrated circuit,ASIC)逻辑、图形处理器、通用处理器等。视频译码系统40也可以包含可选的处理器43,该可选处理器43类似地可以包含专用集成电路(application-specific integrated circuit,ASIC)逻辑、图形处理器、通用处理器等。另外,内存存储器44可以是任何类型的存储器,例如易失性存储器(例如,静态随机存取存储器(static random access memory,SRAM)、动态随机存储器(dynamic random access memory,DRAM)等)或非易失性存储器(例如,闪存等)等。在非限制性实例中,内存存储器44可以由超速缓存内存实施。在其它实例中,处理电路46可以包含存储器(例如,缓存等)用于实施图像缓冲器等。
在一些实例中,通过逻辑电路实施的视频编码器20可以包含(例如,通过处理电路46或内存存储器44实施的)图像缓冲器和(例如,通过处理电路46实施的)图形处理单元。图形处理单元可以通信耦合至图像缓冲器。图形处理单元可以包含通过处理电路46实施的视频编码器20,以实施参照图2和/或本文中所描述的任何其它编码器系统或子系统所论述的各种模块。逻辑电路可以用于执行本文所论述的各种操作。
在一些实例中,视频解码器30可以以类似方式通过处理电路46实施,以实施参照图3的视频解码器30和/或本文中所描述的任何其它解码器系统或子系统所论述的各种模块。在一些实例中,逻辑电路实施的视频解码器30可以包含(通过处理电路46或内存存储器44实施的)图像缓冲器和(例如,通过处理电路46实施的)图形处理单元。图形处理单元可以通信耦合至图像缓冲器。图形处理单元可以包含通过处理电路46实施的视频解码器30,以实施参照图3和/或本文中所描述的任何其它解码器系统或子系统所论述的各种模块。
在一些实例中,天线42可以用于接收视频数据的经编码比特流。如所论述,经编码比特流可以包含本文所论述的与编码视频帧相关的数据、指示符、索引值、模式选择数据等,例如与编码分割相关的数据(例如,变换系数或经量化变换系数,(如所论述的)可选指示符,和/或定义编码分割的数据)。视频译码系统40还可包含耦合至天线42并用于解码经编码比特流的视频解码器30。显示设备45用于呈现视频帧。
应理解,本申请实施例中对于参考视频编码器20所描述的实例,视频解码器30可以用于执行相反过程。关于信令语法元素,视频解码器30可以用于接收并解析这种语法元素,相应地解码相关视频数据。在一些例子中,视频编码器20可以将语法元素熵编码成经编码视频比特流。在此类实例中,视频解码器30可以解析这种语法元素,并相应地解码相关视频数据。
为便于描述,参考通用视频编码(Versatile video coding,VVC)参考软件或由ITU-T视频编码专家组(Video Coding Experts Group,VCEG)和ISO/IEC运动图像专家组(Motion Picture Experts Group,MPEG)的视频编码联合工作组(Joint Collaboration Team on Video Coding,JCT-VC)开发的高性能视频编码(High-Efficiency Video Coding,HEVC)描述本发明实施例。本领域普通技术人员理解本发明实施例不限于HEVC或VVC。
图2为本发明实施例提供的视频译码设备400的示意图。视频译码设备400适用于实现本文描述的公开实施例。在一个实施例中,视频译码设备400可以是解码器,例如图1A中的视频解码器30,也可以是编码器,例如图1A中的视频编码器20。
视频译码设备400包括:用于接收数据的入端口410(或输入端口410)和接收单元(receiver unit,Rx)420;用于处理数据的处理器、逻辑单元或中央处理器(central processing unit,CPU)430;例如,这里的处理器430可以是神经网络处理器430;用于传输数据的发送单元(transmitter unit,Tx)440和出端口450(或输出端口450);用于存储数据的存储器460。视频译码设备400还可包括耦合到入端口410、接收单元420、发送单元440和出端口450的光电(optical-to-electrical,OE)组件和电光(electrical-to-optical,EO)组件,用于光信号或电信号的出口或入口。
处理器430通过硬件和软件实现。处理器430可实现为一个或多个处理器芯片、核(例如,多核处理器)、FPGA、ASIC和DSP。处理器430与入端口410、接收单元420、发送单元440、出端口450和存储器460通信。处理器430包括译码模块470(例如,基于神经网络NN的译码模块470)。译码模块470实施上文所公开的实施例。例如,译码模块470执行、处理、准备或提供各种编码操作。因此,通过译码模块470为视频译码设备400的功能提供了实质性的改进,并且影响了视频译码设备400到不同状态的切换。或者,以存储在存储器460中并由处理器430执行的指令来实现译码模块470。
存储器460包括一个或多个磁盘、磁带机和固态硬盘,可以用作溢出数据存储设备,用于在选择执行程序时存储此类程序,并且存储在程序执行过程中读取的指令和数据。存储器460可以是易失性和/或非易失性的,可以是只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、三态内容寻址存储器(ternary content-addressable memory,TCAM)和/或静态随机存取存储器(static random-access memory,SRAM)。
图3为示例性实施例提供的装置500的简化框图,装置500可用作图1A中的源设备12和目的设备14中的任一个或两个。
装置500中的处理器502可以是中央处理器。或者,处理器502可以是现有的或今后将研发出的能够操控或处理信息的任何其它类型设备或多个设备。虽然可以使用如图所示的处理器502等单个处理器来实施已公开的实现方式,但使用一个以上的处理器速度更快和效率更高。
在一种实现方式中,装置500中的存储器504可以是只读存储器(ROM)设备或随机存取存储器(RAM)设备。任何其它合适类型的存储设备都可以用作存储器504。存储器504可以包括处理器502通过总线512访问的代码和数据506。存储器504还可包括操作系统508和应用程序510,应用程序510包括允许处理器502执行本文所述方法的至少一个程序。例如,应用程序510可以包括应用1至N,还包括执行本文所述方法的视频译码应用。
装置500还可以包括一个或多个输出设备,例如显示器518。在一个示例中,显示器518可以是将显示器与可用于感测触摸输入的触敏元件组合的触敏显示器。显示器518可以通过总线512耦合到处理器502。
虽然装置500中的总线512在本文中描述为单个总线,但是总线512可以包括多个总 线。此外,辅助储存器可以直接耦合到装置500的其它组件或通过网络访问,并且可以包括存储卡等单个集成单元或多个存储卡等多个单元。因此,装置500可以具有各种各样的配置。
图4为端到端的深度学习图像编解码框架的示例图,如图4所示,该图像编解码框架包括编码端:编码网络(Encoder)、量化模块、熵编码网络;解码端:熵解码网络、解码网络(Decoder),以及熵估计网络。
在编码端,原图经过编码网络的处理从图像域变换到特征域,变换后的图像特征经过量化模块和熵编码网络的处理被编码成待传输或待存储的码流。在解码端,码流经过熵解码网络的处理被解码成图像特征,图像特征经过解码网络的处理从特征域变换到图像域,从而得到重建图像。熵估计网络根据图像特征估计得到每个特征元素的估计概率值,以用于熵编码网络和熵解码网络的处理。
本实施例中,编码网络(Encoder)和解码网络(Decoder)中都具有非线性变换单元。
图5为端到端的深度学习视频编解码框架的示例图,如图5所示,该视频编解码框架包括预测模块(predict model)和残差压缩(residual compress)模块,
预测模块利用前一帧的重建图像对当前帧进行预测得到预测图像,残差压缩模块一方面对当前帧的原始图像和预测图像之间的残差进行压缩,另一方面解压缩得到重建残差,重建残差和预测图像求和得到当前帧的重建图像。预测模块和残差压缩模块中的编码子网络和解码子网络中均具有非线性变换单元。
本实施例中,预测模块(predict model)和残差压缩(residual compress)模块中都具有非线性变换单元。
图6为本申请实施例的应用场景的示例图,如图6所示,该应用场景可以是终端、云服务器、视频监控中涉及图像/视频采集、存储或传输的业务,例如,终端拍照/录像、相册、云相册、视频监控等。
编码端:摄像头(Camera)采集图像/视频。人工智能(artificial intelligence,AI)图像/视频编码网络对图像/视频进行从特征提取得到冗余度较低的图像特征,进而基于图像特征进行压缩得到码流/图像文件。
解码端:当需要输出图像/视频时,AI图像/视频解码网络对码流/图像文件进行解压缩得到图像特征,再对图像特征进行反特征提取得到重建的图像/视频。
存储/传输模块将压缩得到的码流/图像文件针对不同业务进行存储(例如,终端拍照、视频监控、云服务器等)或传输(例如,云服务,直播技术等)。
图7为本申请实施例的应用场景的示例图,如图7所示,该应用场景可以是终端、视频监控中涉及图像/视频采集、存储或传输的业务,例如,终端相册、视频监控、直播等。
编码端:编码网络将图像/视频变换成冗余度更低的图像特征,其通常包含非线性变换单元,具有非线性特性。熵估计网络负责计算图像特征中各个数据的编码概率。熵编码网络根据各个数据对应的概率,对图像特征进行无损编码得到码流/图像文件,进一步降低图像压缩过程中的数据传输量。
解码端:熵解码网络根据各个数据对应的概率,对码流/图像文件进行无损解码得到重建的图像特征。解码网络对熵解码输出的图像特征进行反变换,将其解析为图像/视频。和编码网络对应,其通常包含非线性变换单元,具有非线性特性。保存模块将码流/图像文件 保存至终端的对应存储位置。加载模块从终端的对应存储位置加载码流/图像文件,并输入到熵解码网络。
图8为本申请实施例的应用场景的示例图,如图8所示,该应用场景可以是云、视频监控中涉及图像/视频采集、存储或传输的业务,例如,云相册、视频监控、直播等。
编码端:本地获取图像/视频,对其进行图像(JPEG)编码得到压缩图像/视频,之后向云端发送压缩图像/视频。云端对压缩图像/视频进行JPEG解码得到图像/视频,之后对图像/视频进行压缩得到码流/图像文件并存储。
解码端:当本地需要从云端获取图像/视频时,云端对码流/图像文件进行解压缩得到图像/视频,之后对图像/视频进行JPEG编码,得到压缩图像/视频,向本地发送压缩图像/视频。本地对压缩图像/视频进行JPEG解码得到图像/视频。云端的结构以及各个模块的用途可以参考图7的结构以及各个模块的用途,本申请实施例在此不做赘述。
基于上述编/解码网络和应用场景,本申请实施例提供了一种图像编/解码方法,以实现高效的非线性变换处理,提升图像/视频压缩算法中的率失真性能。
图9为本申请实施例的图像编码方法的过程900的流程图。过程900可由上述实施例中的编码端执行。过程900描述为一系列的步骤或操作,应当理解的是,过程900可以以各种顺序执行和/或同时发生,不限于图9所示的执行顺序。过程900包括如下步骤:
步骤901、获取待处理的第一图像特征。
第一图像特征是编码端在获取到待处理图像后,对其进行从图像域到特征域的转换后得到的,这里的转换可以包括:1、卷积处理,使用卷积层来进行提取特征,具有局部感受野,权重共享机制(即每一个滤波器滑动处理输入特征)。2、使用MLP或全连接层提取特征,具有全局感受野特性,权重不共享。3、变换器(Transformer)处理,其中包括矩阵相乘、MLP和归一化处理,具有全局感受野特性,捕捉长距离依赖能力强。
第一图像特征可以表示成二维矩阵(L×C,L表示长度,C表示通道(channel))或三维矩阵(C×H×W,C表示通道数,H表示高度,W表示宽度)的形式,具体的形式与前述转换方式相关联,例如,采用卷积处理或者MLP提取得到的第一图像特征一般对应于三维矩阵,而采用变换器处理得到的第一图像特征一般对应于二维矩阵。
例如,第一图像特征表示为一个二维矩阵:
Figure PCTCN2022135204-appb-000011
该二维矩阵A是3×3的矩阵,包含9个元素,每个元素a(i,j)对应第一图像特征的一个特征值,其中,i表示元素a(i,j)对应的长度,j表示元素a(i,j)所在的通道。
又例如,第一图像特征表示为一个三维矩阵:
Figure PCTCN2022135204-appb-000012
该三维矩阵B是3×3×2的矩阵,包含18个元素,每个元素a(i,j,l)对应第一图像特征的一个特征值,其中,i表示元素a(i,j,l)所在的行,j表示元素a(i,j,l)所在的列,l表示元素a(i,j,l)所在的通道。
需要说明的是,本申请实施例对第一图像特征的获取方式不做具体限定。
待处理图像可以是一张图片,也可以是视频中的一帧图像帧,还可以是前述图片或图像帧分割得到的图像块,对此不做具体限定。
步骤902、对第一图像特征进行非线性变换处理得到经处理的图像特征。
在一种可能的实现方式中,对第一图像特征中的每个特征值进行第一非线性运算得到第二图像特征;对第二图像特征进行卷积处理得到第三图像特征,第三图像特征中的多个特征值和第一图像特征中的多个特征值相对应;将第一图像特征和第三图像特征中相对应的多个特征值进行逐点相乘运算得到经处理的图像特征。
图10a为具备注意力机制的非线性变换单元的结构图,如图10a所示,本申请实施例中,非线性变换单元用于实现上述非线性变换处理,其包括第一非线性运算、卷积处理和逐点相乘运算。
第一非线性运算是针对上述第一图像特征中的每个特征值进行的运算,可以包括取绝对值运算、ReLU系列、Sigmoid、Tanh、PWL运算等。其中,
取绝对值是指针对输入的特征值取其绝对值。可以用如下公式表示:
Figure PCTCN2022135204-appb-000013
分段线性映射包括修正线性单元(ReLU),或者泄漏修正线性单元(LeakyReLU),或者PWL运算。其中,ReLU是一种分段线性映射方法,针对输入的特征值,小于0的特征值输出为0,大于或等于0的特征值恒等输出。可以用如下公式表示:
Figure PCTCN2022135204-appb-000014
LeakyReLU是一种分段线性映射方法,其在ReLU的基础上,将小于0的输入特征值用预先设定的权重进行缩放,权重通常为0.01。可以用如下公式表示:
Figure PCTCN2022135204-appb-000015
其中,a表示预先设定值,通常设置为0.01。
Sigmoid可以表示为如下运算:
Figure PCTCN2022135204-appb-000016
Tanh可以表示为如下运算:
Figure PCTCN2022135204-appb-000017
PWL可以表示为如下运算:
Figure PCTCN2022135204-appb-000018
其中,N表示分段数;B L表示左边界;B R表示右边界;Y P表示N+1个分界点对应的y轴坐标值;K L表示最左斜率;K R表示最右斜率;idx表示x所属分段的索引号;B idx和K idx是该分段的相应左边界和斜率。
Figure PCTCN2022135204-appb-000019
表示分段长度,这些值的计算如下:
Figure PCTCN2022135204-appb-000020
B idx=B L+idx*d   (3)
Figure PCTCN2022135204-appb-000021
需要说明的是,公式(1)可用于不同粒度(按层甚至按通道均可)。按通道应用时,B L,B R,Y P,K L,K R是逐通道参数。可选的,可以在所有的PWL中使用相同的超参N。
PWL的优点有:作为一个通用逼近器,PWL可以逼近任何连续的有界标量函数;PWL随参数(超参N除外)连续变化,非常利于基于梯度优化;灵活性集中在有界区域内,因此可以最大限度地利用可学习参数;由于分段均匀,PWL计算效率高,尤其是在推理方面。
图11为PWL函数示意图,如图11所示,分段数N是一个超参,影响PWL函数的拟合能力。分段数越大,自由度越高,模型容量也越大。左右边界B L和B R限定了PWL关注的主要有效区域。[B L,B R]被均匀划分为N个分段,得到N+1个分界点。每个分界点都有对应的y轴坐标值Y P,这些坐标就决定了PWL的曲线形状。[B L,B R]之外的区域,可以用两个斜率K L和K R来控制边界外的区域形状。
需要说明的是,分段线性映射还可以采用ReLU的其它变形方式,也可以采用其它新的第一非线性运算方式,本申请实施例对此均不作具体限定。
经过上述第一非线性运算,第一图像特征转换成第二图像特征,第二图像特征也可以如第一图像特征表示成矩阵的形式,由于第一非线性运算是针对第一图像特征中的每个特征值的,因此第一图像特征的每个特征值在第二图像特征值中均有一个特征值与其相对应,因此第二图像特征对应的矩阵和第一图像特征对应的矩阵的尺寸相同,且相同位置的特征值(矩阵元素的值)相对应,例如,第一图像特征表示成上述矩阵A那样的3×3的矩阵,那么第二图像特征也可以表示成3×3的矩阵,但由于第一图像特征和第二图像特征经过了第一非线性运算,因此第一图像特征和第二图像特征中的特征值不完全相同,相应的,二者分别对应的矩阵中的元素值也不完全相同。
对第二图像特征进行卷积处理,输出第三图像特征,第三图像特征可以认为是第二图像特征的局部响应(即校正后的值),即第三图像特征是第二图像特征经过卷积处理得到的响应信号,由于卷积处理的感受野是有限的,经卷积处理输出的图像特征中的每个位置的响应值只和与该位置相邻位置的输入特征值有关,因此称为局部响应。卷积处理可以表示为如下公式:
conv1(x)=β+∑γ×x
其中,γ表示卷积层的权重,β表示卷积层的偏置参数。
经过卷积处理,第三图像特征对应的矩阵和第一图像特征对应的矩阵的尺寸也相同。图12为卷积处理的一个示意图,如图12所示,输入一个1*2的矩阵,经过卷积层的处理输出1*4的矩阵,该卷积层包括两个滤波器,一个是2*50的矩阵W1,另一个是50*4的矩阵W2。先由矩阵W1对输入矩阵进行卷积运算,得到1*50的矩阵,然后由矩阵W2对该矩阵进行卷积运算得到1*4的输出矩阵。
将第一图像特征和第三图像特征中相对应的多个特征值进行逐点相乘运算,即为将非线性变换单元的初始输入和卷积处理的输出进行逐点相乘,可以表示为如下公式:
c(i,j)=a(i,j)×b(i,j)
其中,(i,j)表示特征值在其所在图像特征中的索引,a(i,j)表示第一图像特征的特征值, b(i,j)表示第二图像特征的特征值,c(i,j)表示经处理的图像特征的特征值。
由此可见,经处理的图像特征对应的矩阵和第一图像特征对应的矩阵的尺寸也相同。
需要说明的是,本申请实施例中,两个图形特征中的多个特征值相对应可以是指该两个图像特征分别表示为矩阵后,两个矩阵中的相同位置上的元素的值具有运算关系,该二者相对应。例如,第一图像特征和第三图像特征均表示成上述矩阵A的形式,同位于a 0,2的位置的元素相对应。
此外,卷积处理也可以得到与第一图像特尺寸不同的第三图像特征,这取决于卷积处理的内部结构,尤其是在进行卷积处理时的滤波器的长、宽以及通道数,如果第三图像特征和第一图像特征的尺寸不同,相应的第三图像特征对应的矩阵中的元素和第一图像特征对应的矩阵中的元素也不是一一对应的,此时可以考虑第一图像特征对应的矩阵中的多个元素与第三图像特征对应的矩阵中的同一个元素相乘。例如,第一图像特征对应的矩阵的通道数为3,而第三图像特征对应的矩阵的通道数为1,可以将该第三图像特征对应的矩阵中的元素与第一图像特征对应的矩阵中各个通道中的同一位置的元素分别相乘。本申请实施例对此不做具体限定。
上述第一非线性运算、卷积处理和逐点相乘运算可以表示为如下公式(1):
Figure PCTCN2022135204-appb-000022
其中,x表示输入的特征值,y表示输出的特征值,
Figure PCTCN2022135204-appb-000023
表示第一非线性运算,γ表示卷积层的权重,β表示卷积层的偏置参数。
由于上述公式中采用了相乘运算,相较于相关技术(GDN除法归一化)的卷积处理中,
Figure PCTCN2022135204-appb-000024
的除法运算,一方面,不必为了迁就开平方根必须是正数的限定,而要求卷积参数γ和β必须是正数,另一方面,不必为了迁就除法要求分母不能为0的限定,也对卷积参数γ和β构成了取值上的限定。
经过上述第一非线性运算、卷积处理和逐点相乘运算实现了局部注意力机制,局部是指第一非线性运算是逐点进行的,输入的各个特征值只需要自身的特性即可得到输出值,不用考虑周边特征值的影响;注意力机制是指,第一图像特征中的所有特征值,有的是重要的,有的是冗余的,卷积处理的输出可以是图像特征中的各个特征值的权重,可以对原特征值进行修正,把重要的特征值突出,把冗余的特征值抑制;逐点相乘运算利用前述局部信息校正第一图像特征上每个特征值的值,避免了对卷积参数的限定。
在一种可能的实现方式中,对第一图像特征中的每个特征值进行第一非线性运算得到第二图像特征;对第二图像特征进行卷积处理得到第三图像特征,第三图像特征中的多个特征值和第一图像特征中的多个特征值相对应;将第一图像特征和第三图像特征中相对应的多个特征值进行逐点相乘运算得到第四图像特征,第四图像特征中的多个特征值和第一图像特征中的多个特征值相对应;将第一图像特征和第四图像特征中相对应的多个特征值进行逐点相加运算得到经处理的图像特征。
图10b为具备注意力机制的残差非线性变换单元的结构图,如图10b所示,本申请实施例中,非线性变换单元用于实现上述非线性变换处理,其包括第一非线性运算、卷积处理、逐点相乘运算和逐点相加运算。
本申请实施例中,第一非线性运算、卷积处理和逐点相乘运算均可以参照上一种实施 方式中的描述,此处不再赘述。
得到第四图像特征后,再将第一图像特征和第四图像特征中相对应的多个特征值进行逐点相加运算,即为将非线性变换单元的初始输入和逐点相乘运算的输出进行逐点相加,可以表示为如下公式:
sum(i,j)=a(i,j)+c(i,j)
其中,(i,j)表示特征值在其所在图像特征中的索引,a(i,j)表示第一图像特征的特征值,c(i,j)表示第四图像特征的特征值,sum(i,j)表示经处理的图像特征的特征值。
逐点相加运算是一个首位相加的残差结构,可以使使用上述处理过程的编解码网络在训练时更容易收敛。
上述第一非线性运算、卷积处理、逐点相乘运算和逐点相加运算可以表示为如下公式(2):
Figure PCTCN2022135204-appb-000025
其中,x表示输入的特征值,y表示输出的特征值,
Figure PCTCN2022135204-appb-000026
表示第一非线性运算,γ表示卷积层的权重,β表示卷积层的偏置参数。
上述公式(2)经变形可以得到:
Figure PCTCN2022135204-appb-000027
与公式(1)对比可以得到卷积处理可以表示为如下公式:
conv2(x)=β+1+∑γ×x
由此可见,卷积处理conv1(x)和卷积处理conv2(x)相似,区别在于卷积处理conv2(x)中的卷积参数多加了1。这样可以通过对卷积处理中的卷积参数β的微调,实现上述两种实施方式之间的转换,即如果非线性变换单元中不包含逐点相加运算,则可以采用卷积处理conv1(x);如果非线性变换单元中包含逐点相加运算,则可以采用卷积处理conv2(x)。
在一种可能的实现方式中,在上述第一种实施方式的基础上,还可以对第三图像特征进行第二非线性运算,再对输出的经处理的第三图像特征和第一图像特征进行逐点相乘运算,或者,在上述第二种实施方式的基础上,还可以对第三图像特征进行第二非线性运算,再对输出的经处理的第三图像特征和第一图像特征进行逐点相乘运算。即在非线性变换处理中增加了第二非线性运算,该第二非线性运算的输入均是非线性变换处理中的卷积处理的输出,该第二非线性运算的输出均是作为逐点相乘运算的一个输入。
图10c为具备注意力机制的残差非线性变换单元的结构图,如图10c所示,本申请实施例中,非线性变换单元用于实现上述非线性变换处理,其包括第一非线性运算、卷积处理、第二非线性运算和逐点相乘运算。
其中,第一非线性运算、卷积处理和逐点相乘运算可以参照图10a所示实施例的描述,此处不再赘述。第二非线性运算可以与第一非线性运算采用相同的运算方法,也可以采用不同的运算方法,其可以包括取绝对值、ReLU、LeakyReLU等,本申请实施例对此不做具体限定。
上述第一非线性运算、卷积处理、第二非线性运算和逐点相乘运算可以表示为如下公式(3):
Figure PCTCN2022135204-appb-000028
其中,x表示输入的特征值,y表示输出的特征值,
Figure PCTCN2022135204-appb-000029
表示第一非线性运算,
Figure PCTCN2022135204-appb-000030
表示第二非线性运算,γ表示卷积层的权重,β表示卷积层的偏置参数。
图10d为具备注意力机制的非线性变换单元的结构图,如图10d所示,本申请实施例中,非线性变换单元用于实现上述非线性变换处理,其包括第一非线性运算、卷积处理、第二非线性运算、逐点相乘运算和逐点相加运算。
其中,第一非线性运算、卷积处理、逐点相乘运算和逐点相加运算可以参照图10b所示实施例的描述,此处不再赘述。第二非线性运算可以与第一非线性运算采用相同的运算方法,也可以采用不同的运算方法,其可以包括取绝对值、ReLU、LeakyReLU等,本申请实施例对此不做具体限定。
当第二非线性运算使用分段线性映射实现时,分段线性映射可以采用不同的分段数量;在各分段上的映射斜率可以通过训练的方式或者直接指定的方式确定;对输入特征图像的每个通道可以采用不同的分段线性函数,或者对所有通道都是用同一个分段线性函数,或者对若干个通道采用同一个分段线性函数进行处理。在这种实现方式下,模型训练完毕之后,残差结构不再与卷积融合,而是可与分段线性函数融合,即在原来的分段线性函数的输出上+1,组合成一个新的分段线性函数。
上述第一非线性运算、卷积处理、第二非线性运算、逐点相乘运算和逐点相加运算可以表示为如下公式(4):
Figure PCTCN2022135204-appb-000031
其中,x表示输入的特征值,y表示输出的特征值,
Figure PCTCN2022135204-appb-000032
表示第一非线性运算,
Figure PCTCN2022135204-appb-000033
表示第二非线性运算,γ表示卷积层的权重,β表示卷积层的偏置参数。
步骤903、根据经处理的图像特征进行编码得到码流。
得到经处理的图像特征后,编码侧可以继续对其进行卷积处理,也可以在卷积处理之后再次对卷积处理的输出进行非线性变换处理,然后再对前述处理的结果进行熵编码以得到码流,熵编码可以采用图4~图8所示实施例中的熵编码网络实现,此处不再赘述。也可以采用其他的编码方式对前述处理的结果进行编码以得到码流,本申请对此不做限定。
图13为编码网络的结构图,如图13所示,该编码网络包括4个卷积层(conv)和3个非线性变换单元,卷积层和非线性变换单元交叉排列,即对输入图像依次进行卷积处理、非线性变换处理、卷积处理、非线性变换处理、卷积处理、非线性变换处理以及卷积处理得到输出图像特征。非线性变换单元可以采用图10a或图10b所示实施例的结构。而后输出图像特征再进行熵编码,此处不再赘述。
本申请实施例,通过对解码网络中的非线性变换处理进行改变,可以使输入的各个特征值只需要自身的特性即可得到输出值,不用考虑周边特征值的影响,并对原特征值进行修正,把重要的特征值突出,把冗余的特征值抑制,此外还可以校正第一图像特征上每个特征值的值,避免了对卷积参数的限定,从而实现编码网络中的高效的非线性变换处理,进一步提升图像/视频压缩算法中的率失真性能。
图14为本申请实施例的图像解码方法的过程1300的流程图。过程1300可由上述实施例中的解码端执行。过程1300描述为一系列的步骤或操作,应当理解的是,过程1300可以以各种顺序执行和/或同时发生,不限于图14所示的执行顺序。过程1300包括如下步骤:
步骤1301、获取待处理的第一图像特征。
解码端与编码端相对应,第一图像特征可以是解码端对码流进行熵解码后,熵解码可 以采用图4~图8所示实施例中的熵解码网络实现,此处不再赘述。之后再经过卷积处理、反卷积处理、转置卷积处理、插值+卷积处理、Transformer处理等方式得到的。应理解,经过前述处理,输出的第一图像特征的尺寸将复原(和编码端是镜像对称的),也可能使输入的图像特征发生尺寸变化、通道数量改变等,对此不做具体限定。前述处理与图9所示实施例中的步骤901中的转换方式相反。
同样的,第一图像特征也可以表示成二维矩阵或三维矩阵的形式,原理参照步骤901的描述,此处不再赘述。
步骤1302、对第一图像特征进行非线性变换处理得到经处理的图像特征。
在一种可能的实现方式中,对第一图像特征中的每个特征值进行第一非线性运算得到第二图像特征;对第二图像特征进行卷积处理得到第三图像特征,第三图像特征中的多个特征值和第一图像特征中的多个特征值相对应;将第一图像特征和第三图像特征中相对应的多个特征值进行逐点相乘运算得到经处理的图像特征。
本实施例可以参照图10a所示实施例,此处不再赘述。
在一种可能的实现方式中,对第一图像特征中的每个特征值进行第一非线性运算得到第二图像特征;对第二图像特征进行卷积处理得到第三图像特征,第三图像特征中的多个特征值和第一图像特征中的多个特征值相对应;将第一图像特征和第三图像特征中相对应的多个特征值进行逐点相乘运算得到第四图像特征,第四图像特征中的多个特征值和第一图像特征中的多个特征值相对应;将第一图像特征和第四图像特征中相对应的多个特征值进行逐点相加运算得到经处理的图像特征。
本实施例可以参照图10b所示实施例,此处不再赘述。
步骤1303、根据经处理的图像特征获取重建图像。
得到经处理的图像特征后,解码侧也可以继续对其进行卷积处理,也可以在卷积处理之后再次对卷积处理的输出进行非线性变换处理,从而将其从特征域转换到图像域,得到重建图像。
图15为解码网络的结构图,如图15所示,该解码网络包括4个反卷积层(Deconv)和3个非线性变换单元,反卷积层和非线性变换单元交叉排列,即对输入图像依次进行反卷积处理、非线性变换处理、反卷积处理、非线性变换处理、反卷积处理、非线性变换处理以及反卷积处理得到重建图像。非线性变换单元可以采用图10a或图10b所示实施例的结构。而后输出图像特征在进行熵解码,此处不再赘述。
本申请实施例,通过对解码网络中的非线性变换处理进行改变,可以使输入的各个特征值只需要自身的特性即可得到输出值,不用考虑周边特征值的影响,并对原特征值进行修正,把重要的特征值突出,把冗余的特征值抑制,此外还可以校正第一图像特征上每个特征值的值,避免了对卷积参数的限定,从而实现编码网络中的高效的非线性变换处理,进一步提升图像/视频压缩算法中的率失真性能。
需要说明的是,本申请实施例还提供了一种编/解码网络的训练方式,可以包括:首先构建端到端的编/解码网路,包含编码器Encoder、解码器Decoder、熵估计单元。训练时将编解码网路当作一个整体,训练数据(图像或视频)输入到编码器得到特征数据;特征数据一方面经过熵估计单元计算编码码率开销,得到码率loss;特征数据另一方面经过解码端输出重建数据,重建数据和输入数据计算失真程度,得到失真loss。反向传播算法通 过码率loss和失真loss组成的加权loss去更新模型中的可学习参数。训练完毕之后,模型中的所有子模块的参数被固定。
此后,编码器Encoder和熵估计单元被划分到编码端,用于将待编码数据编码成码流文件;解码器和熵估计单元被划分到解码端,用于从码流文件中重建数据。
以下通过几个具体的实施例对上述图像编码方法进行说明,下文中将本申请实施例提供的非线性变换单元称作ResAU。
实施例一
图16a为ResAU的一个示例性的结构图,如图16a所示,ResAU采用图10b所示的结构,包括第一非线性运算、卷积处理、逐点相乘运算和逐点相加运算,其中第一非线性运算采用取绝对值(abs)。该ResAU可以应用于图13所示的编码网络或者图15所示的解码网络。
在公式(2)的基础上,本实施例的ResAU可以表示为如下公式:
y i=x i*(β i+∑ jγ ij|x j|)+x i
对图16a所示的ResAU进行压缩性能测试。
测试集:Kodak测试集,包含24张分辨率为768×512或512×768的便携式网络图形(portable network graphics,PNG)图像。
实验:在混合高斯超先验熵估计的编/解码网络结构中使用图16a所示的ResAU。
实验对应的性能表现:图17a为ResAU在Kodak测试集的24张图像上的整体表现。与当前主流使用的GDN非线性单元相比,在相同的计算复杂度基础上,使用ResAU的GMM网络RD性能更优。更具体地,在达到相同的解码重建质量下,在GMM编/解码网络下使用ResAU可比使用ReLU节省约12%的编码码率开销,可比使用GDN节省约5%的编码码率开销。
性能分析:在本申请实施例提供的ResAU中,通过取绝对值引入非线性特性是一种可行的方案。在基础的GMM结构上,使用ResAU能比使用GDN取得更好的率失真性能。
实施例二
图16b为ResAU的一个示例性的结构图,如图16b所示,ResAU采用图10b所示的结构,包括第一非线性运算、卷积处理、逐点相乘运算和逐点相加运算,其中第一非线性运算采用ReLU运算。该ResAU可以应用于图13所示的编码网络或者图15所示的解码网络。
在公式(2)的基础上,本实施例的ResAU可以表示为如下公式:
Figure PCTCN2022135204-appb-000034
对图16b所示的ResAU进行压缩性能测试。
测试集:24张Kodak测试图像。
实验:在混合高斯超先验熵估计的编/解码网络结构中使用图16b所示的ResAU。在本次实验中,ResAU中的非线性运算分别替换成了ReLU、LeakyReLU,对比实验包括无非线性运算的Identity方案以及实施例一所描述的取绝对值的ResAU方案。
实验对应的性能表现:图17b为ResAU在Kodak测试集的24张图像上的整体表现。从RD曲线可知,相比于不使用非线性运算的Identity方案,使用逐点非线性运算可以较大幅度提升图像压缩网络的率失真性能。而ReLU类的非线性运算效果会略优于取绝对值的非线性运算。
实施例三
图16c为ResAU的一个示例性的结构图,如图16c所示,ResAU采用图10d所示的结构,包括第一非线性运算、卷积处理、第二非线性运算、逐点相乘运算和逐点相加运算,其中第一非线性运算和第二非线性运算均采用PWL。该ResAU可以应用于图13所示的编码网络或者图15所示的解码网络。
ResAU通过分段线性函数PWL引入非线性特性;通过卷积、分段线性函数PWL和逐点相乘运算来实现局部注意力机制,利用局部信息校正特征图上各个位置各个通道上的响应;逐点相乘操作可以避免GDN中存在的可学习参数取值空间受限的问题;另外,首尾相加的残差形式的连接可使网络在训练时更容易收敛。
在公式(4)的基础上,本实施例的ResAU可以表示为如下公式:
Figure PCTCN2022135204-appb-000035
其中,
Figure PCTCN2022135204-appb-000036
Figure PCTCN2022135204-appb-000037
均为PWL函数,γ为卷积层的权重,β为卷积层的偏置参数,两者都是取值范围不受限制的可学习参数。
该结构包含的各个运算的功能如下:
第一个PWL运算:该函数为分段线性函数,可以为整体变换提供非线性特性。在该运算的作用下,输入数据的每个值都会依据其所在数值区间采用不用的映射关系计算得到输出值。对于一个输入特征,其种不同通道维度的特征图可以采用相同或者不同的分段线性映射函数进行运算。其参数可以为预设值,也可以通过训练学习得到。
卷积处理:在ResAU中,卷积处理的输入为非线性运算的输出。输入张量通过卷积处理得到尺寸不变的张量,输出张量可以认为是输入张量的局部响应。
第二个PWL运算:该函数为分段线性函数,可以对卷积的输出进行缩放映射,也为整体变换提供非线性特性。在该运算的作用下,输入数据的每个值都会依据其所在数值区间采用不用的映射关系计算得到输出值。对于一个输入特征,其种不同通道维度的特征图可以采用相同或者不同的分段线性映射函数进行运算。其参数可以为预设值,也可以通过训练学习得到。
逐点相乘运算:在ResAU中,逐点相乘运算的输入为该单元的原始输入和卷积处理的输出。其两项输入张量的尺寸相同,执行对应位置数据的相乘运算,输出张量的尺寸和输入张量的尺寸也相同。卷积处理和逐点相乘运算结合实现了局部注意力机制,利用输入特征的局部信息校正特征图上每个位置的响应。同时,逐点相乘运算避免了当前用于图像视频压缩网络中的主流非线性单元GDN存在的可学习参数取值空间受限的问题。
逐点相加运算:在ResAU中,逐点相加运算的输入为该单元的原始输入和逐点相乘运算的输出。该运算是一个首位相加的残差结构,可以使使用该非线性单元的编解码网络在训练时更容易收敛。
本实施例所描述的使用分段线性函数PWL的残差注意力非线性单元可用于基于深度 神经网络的端到端图像压缩网络和视频压缩网络之中。更具体地,一般用于端到端图像压缩网络中的编码模块(Encoder)和解码模块(Decoder)中,用于端到端视频压缩网络预测子网络和残差压缩子网络的编码模块和解码模块中。
在一个常用的基于超先验结构的端到端编解码网络上进行了实验:当本实施例中的第一个PWL运算预设值为Leaky ReLU函数、且第二个PWL运算分段数为6、通道维度分组粒度为1(即所有通道的特征使用不同的分段线性函数计算输出值)时,能够比采用卷积运算之后使用Tanh实现非线性运算的方案提升0.506%的压缩率失真性能;当本实施例中的第一个PWL运算预设值为Leaky ReLU函数、且第二个PWL运算分段数为6、通道维度分组粒度为8(即以每8个通道为单位对特征进行分组,同一组内的所有通道使用相同的分段线性函数计算输出值)时,比采用卷积运算之后使用Tanh实现非线性运算的方案降低0.488%的压缩率失真性能;当本实施例中的第一个PWL运算预设值为Leaky ReLU函数、且第二个PWL运算分段数为6、通道维度不分组(即所有通道的特征使用相同的分段线性函数计算输出值)时,比采用卷积运算之后使用Tanh实现非线性运算的方案降低0.659%的压缩率失真性能。
可选的,图16d为ResAU的一个示例性的结构图,如图16d所示,ResAU采用图10d所示的结构,包括第一非线性运算、卷积处理、第二非线性运算、逐点相乘运算和逐点相加运算,其中第一非线性运算采用LeakyReLU,第二非线性运算采用tanh。卷积处理可以是conv1×1,该conv1×1表示卷积核(或卷积算子)的尺寸为1×1。该ResAU可以应用于图13所示的编码网络或者图15所示的解码网络。
实施例四
图16e为ResAU的一个示例性的结构图,如图16e所示,ResAU采用图10c所示的结构,包括第一非线性运算、卷积处理、第二非线性运算和逐点相乘运算,其中第一非线性运算和第二非线性运算均采用PWL。该ResAU可以应用于图13所示的编码网络或者图15所示的解码网络。
在公式(3)的基础上,本实施例的ResAU可以表示为如下公式:
Figure PCTCN2022135204-appb-000038
其中,
Figure PCTCN2022135204-appb-000039
Figure PCTCN2022135204-appb-000040
均为PWL函数,γ为卷积层的权重,β为卷积层的偏置参数,两者都是取值范围不受限制的可学习参数。
该结构包含的各个运算的功能如下:
第一个PWL运算:该函数为分段线性函数,可以为整体变换提供非线性特性。在该运算的作用下,输入数据的每个值都会依据其所在数值区间采用不用的映射关系计算得到输出值。对于一个输入特征,其种不同通道维度的特征图可以采用相同或者不同的分段线性映射函数进行运算。其参数可以为预设值,也可以通过训练学习得到。
卷积处理:在ResAU中,卷积处理的输入为非线性运算的输出。输入张量通过卷积处理得到尺寸不变的张量,输出张量可以认为是输入张量的局部响应。
第二个PWL运算:该函数为分段线性函数,可以对卷积的输出进行缩放映射,也为整体变换提供非线性特性。在该运算的作用下,输入数据的每个值都会依据其所在数值区间采用不用的映射关系计算得到输出值。对于一个输入特征,其种不同通道维度的特征图 可以采用相同或者不同的分段线性映射函数进行运算。其参数可以为预设值,也可以通过训练学习得到。
逐点相乘运算:在ResAU中,逐点相乘运算的输入为该单元的原始输入和卷积处理的输出。其两项输入张量的尺寸相同,执行对应位置数据的相乘运算,输出张量的尺寸和输入张量的尺寸也相同。卷积处理和逐点相乘运算结合实现了局部注意力机制,利用输入特征的局部信息校正特征图上每个位置的响应。同时,逐点相乘运算避免了当前用于图像视频压缩网络中的主流非线性单元GDN存在的可学习参数取值空间受限的问题。
本实施例所描述的使用分段线性函数的无残差结构的注意力非线性单元可用于基于深度神经网络的端到端图像压缩网络和视频压缩网络之中。更具体地,一般用于端到端图像压缩网络中的编码模块Encoder和解码模块Decoder中,用于端到端视频压缩网络预测子网络和残差压缩子网络的编码模块和解码模块中。并且,本实施例所描述的使用分段线性函数的无残差结构的注意力非线性单元可以通过使用分段线性函数的残差注意力非线性单元转换得到。在训练完成之后,可以将使用分段线性函数的残差注意力非线性单元中的第二个PWL运算和逐点相加运算合并(即将第二个PWL运算的输出整体加1,形成新的PWL函数,并去除逐点相加运算),得到一个对应的使用分段线性函数的无残差结构的注意力非线性单元。
在一个常用的基于超先验结构的端到端编解码网络上进行了实验,通过将实施例四对应的实验中的残差注意力非线性单元在训练完毕后转换为本实施例所描述的无残差结构的非线性单元,可以获得和实施例四对应方案相同的编解码效果。
当本实施例中的第一个PWL运算预设值为Leaky ReLU函数、且第二个PWL运算分段数为6、通道维度分组粒度为1(即所有通道的特征使用不同的分段线性函数计算输出值)时,能够比采用卷积运算之后使用Tanh实现非线性运算的方案提升0.506%的压缩率失真性能。并且,本实施例方案能够节省在逐点相加运算上的计算耗时和功耗。
当本实施例中的第一个PWL运算预设值为Leaky ReLU函数、且第二个PWL运算分段数为6、通道维度分组粒度为8(即以每8个通道为单位对特征进行分组,同一组内的所有通道使用相同的分段线性函数计算输出值)时,比采用卷积运算之后使用Tanh实现非线性运算的方案降低0.488%的压缩率失真性能。并且,本实施例方案能够节省在逐点相加运算上的计算耗时和功耗。
当本实施例中的第一个PWL运算预设值为Leaky ReLU函数、且第二个PWL运算分段数为6、通道维度不分组(即所有通道的特征使用相同的分段线性函数计算输出值)时,比采用卷积运算之后使用Tanh实现非线性运算的方案降低0.659%的压缩率失真性能。并且,本实施例方案能够节省在逐点相加运算上的计算耗时和功耗。
图18为本申请实施例编码装置1700的一个示例性的结构示意图,如图18所示,本实施例的装置1700可以应用于编码端。该装置1700可以包括:获取模块1701、变换模块1702、编码模块1703和训练模块1704。其中,
获取模块1701,用于获取待处理的第一图像特征;变换模块1702,用于对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理依次包括第一非线性运算、卷积处理和逐点相乘运算;编码模块1703,用于根据所述经处理的图像特征进行编码得到码流。
在一种可能的实现方式中,所述变换模块1702,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到所述经处理的图像特征。
在一种可能的实现方式中,所述非线性变换处理还包括在所述逐点相乘运算之后的逐点相加运算。
在一种可能的实现方式中,所述变换模块1702,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到第四图像特征,所述第四图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第四图像特征中相对应的多个特征值进行所述逐点相加运算得到所述经处理的图像特征。
在一种可能的实现方式中,所述非线性变换处理还包括在所述卷积处理和所述逐点相乘运算之间的第二非线性运算,所述第二非线性运算和所述第一非线性运算相同或不同。
在一种可能的实现方式中,所述非线性运算包括分段线性映射,例如ReLU,LeakyReLU,PWL,Abs等。在另一种可能的实现方式中,所述非线性运算包括连续函数,例如Tanh,Sigmoid等。在另一种可能的实现方式中,所述非线性运算包括分段的非线性运算。
在一种可能的实现方式中,还包括:训练模块1704,用于构建训练阶段的非线性变换单元,所述训练阶段的非线性变换单元包括第一非线性运算层、卷积处理层、逐点相乘运算层和逐点相加运算层;根据预先获取的训练数据训练得到经训练的非线性变换单元,所述经训练的非线性变换单元用于实现所述非线性变换处理。
本实施例的装置,可以用于执行图9所示方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
图19为本申请实施例解码装置1800的一个示例性的结构示意图,如图19所示,本实施例的装置1800可以应用于解码端。该装置1800可以包括:获取模块1801、变换模块1802、重建模块1803和训练模块1804。其中,
获取模块1801,用于获取待处理的第一图像特征;变换模块1802,用于对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理依次包括第一非线性运算、卷积处理和逐点相乘运算;重建模块1803,用于根据所述经处理的图像特征获取重建图像。
在一种可能的实现方式中,所述变换模块1802,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到所述经处理的图像特征。
在一种可能的实现方式中,所述非线性变换处理还包括在所述逐点相乘运算之后的逐 点相加运算。
在一种可能的实现方式中,所述变换模块1802,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到第四图像特征,所述第四图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第四图像特征中相对应的多个特征值进行所述逐点相加运算得到所述经处理的图像特征。
在一种可能的实现方式中,所述非线性变换处理还包括在所述卷积处理和所述逐点相乘运算之间的第二非线性运算,所述第二非线性运算和所述第一非线性运算相同或不同。
在一种可能的实现方式中,所述非线性运算包括分段线性映射,例如ReLU,LeakyReLU,PWL,Abs等。在另一种可能的实现方式中,所述非线性运算包括连续函数,例如Tanh,Sigmoid等。在另一种可能的实现方式中,所述非线性运算包括分段的非线性运算。
在一种可能的实现方式中,训练模块1804,用于构建训练阶段的非线性变换单元,所述训练阶段的非线性变换单元包括第一非线性运算层、卷积处理层、逐点相乘运算层和逐点相加运算层;根据预先获取的训练数据训练得到经训练的非线性变换单元,所述经训练的非线性变换单元用于实现所述非线性变换处理。
本实施例的装置,可以用于执行图14所示方法实施例的技术方案,其实现原理和技术效果类似,此处不再赘述。
本申请实施例提供一种码流,该码流由处理器执行上述实施例中任一种编码方法所生成。
本申请实施例提供一种存储码流的装置,该装置包括:接收器和至少一个存储介质,接收器用于接收码流;至少一个存储介质用于存储码流;码流为上述实施例中任一种编码方法所生成的码流。
本申请实施例提供一种传输码流的装置,该装置包括:发送器和至少一个存储介质,至少一个存储介质用于存储码流,码流由处理器执行上述实施例中任一种编码方法所生成;发送器用于将码流发送给其他电子设备。可选的,该传输码流的装置还包括接收器和处理器,接收器用于接收用户请求,处理器用于响应于用户请求,以从存储介质中选择目标码流,并指示发送器发送目标码流。
本申请实施例提供一种分发码流的系统,该系统包括:至少一个存储介质和流媒体设备,至少一个存储介质用于存储至少一个码流,至少一个码流包括根据第一方面的任意一种实现方式生成的码流;流媒体设备用于从至少一个存储介质中获取目标码流,并将目标码流发送给端侧设备,其中,流媒体设备包括内容服务器或内容分发服务器。
本申请实施例提供一种分发码流的系统,该系统包括:通信接口,用于接收用户获取目标码流的请求;处理器,用于响应于用户请求,确定目标码流的存储位置;通信接口还用于将目标码流的存储位置发送给用户,以使得用户从目标码流的存储位置获取目标码流,其中,目标码流由处理器执行上述实施例中任一种编码方法所生成。
在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或 者软件形式的指令完成。处理器可以是通用处理器、数字信号处理器(digital signal processor,DSP)、特定应用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。本申请实施例公开的方法的步骤可以直接体现为硬件编码处理器执行完成,或者用编码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
上述各实施例中提及的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(个人计算机,服务器,或者网络设备等)执行本申请实施例各个实施例所述方法的全部或部分步骤。或者该计算机软件产品可以从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD)、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM))等。
以上所述,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请实施例揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以所述权利要求的保护范围为准。

Claims (36)

  1. 一种图像编码方法,其特征在于,包括:
    获取待处理的第一图像特征;
    对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理依次包括第一非线性运算、卷积处理和逐点相乘运算;
    根据所述经处理的图像特征进行编码得到码流。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述第一图像特征进行非线性变换处理得到经处理的图像特征,包括:
    对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;
    对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;
    将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到所述经处理的图像特征。
  3. 根据权利要求1所述的方法,其特征在于,所述非线性变换处理还包括在所述逐点相乘运算之后的逐点相加运算。
  4. 根据权利要求3所述的方法,其特征在于,所述对所述第一图像特征进行非线性变换处理得到经处理的图像特征,包括:
    对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;
    对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;
    将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到第四图像特征,所述第四图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;
    将所述第一图像特征和所述第四图像特征中相对应的多个特征值进行所述逐点相加运算得到所述经处理的图像特征。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述非线性变换处理还包括在所述卷积处理和所述逐点相乘运算之间的第二非线性运算,所述第二非线性运算和所述第一非线性运算相同或不同。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述第一非线性运算包括Sigmoid、Tanh或者分段线性映射。
  7. 根据权利要求5或6所述的方法,其特征在于,所述第二非线性运算包括Sigmoid、Tanh或者分段线性映射。
  8. 根据权利要求1-7中任一项所述的方法,其特征在于,所述方法还包括:
    构建训练阶段的非线性变换单元,所述训练阶段的非线性变换单元包括第一非线性运算层、卷积处理层、逐点相乘运算层和逐点相加运算层;
    根据预先获取的训练数据训练得到经训练的非线性变换单元,所述经训练的非线性变换单元用于实现所述非线性变换处理。
  9. 一种图像解码方法,其特征在于,包括:
    获取待处理的第一图像特征;
    对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理依次包括第一非线性运算、卷积处理和逐点相乘运算;
    根据所述经处理的图像特征获取重建图像。
  10. 根据权利要求9所述的方法,其特征在于,所述对所述第一图像特征进行非线性变换处理得到经处理的图像特征,包括:
    对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;
    对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;
    将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到所述经处理的图像特征。
  11. 根据权利要求9所述的方法,其特征在于,所述非线性变换处理还包括在所述逐点相乘运算之后的逐点相加运算。
  12. 根据权利要求11所述的方法,其特征在于,所述对所述第一图像特征进行非线性变换处理得到经处理的图像特征,包括:
    对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;
    对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;
    将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到第四图像特征,所述第四图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;
    将所述第一图像特征和所述第四图像特征中相对应的多个特征值进行所述逐点相加运算得到所述经处理的图像特征。
  13. 根据权利要求9-12中任一项所述的方法,其特征在于,所述非线性变换处理还包括在所述卷积处理和所述逐点相乘运算之间的第二非线性运算,所述第二非线性运算和所述第一非线性运算相同或不同。
  14. 根据权利要求9-13中任一项所述的方法,其特征在于,所述第一非线性运算包括Sigmoid、Tanh或者分段线性映射。
  15. 根据权利要求13或14所述的方法,其特征在于,所述第二非线性运算包括Sigmoid、Tanh或者分段线性映射。
  16. 根据权利要求9-15中任一项所述的方法,其特征在于,所述方法还包括:
    构建训练阶段的非线性变换单元,所述训练阶段的非线性变换单元包括第一非线性运算层、卷积处理层、逐点相乘运算层和逐点相加运算层;
    根据预先获取的训练数据训练得到经训练的非线性变换单元,所述经训练的非线性变换单元用于实现所述非线性变换处理。
  17. 一种编码装置,其特征在于,包括:
    获取模块,用于获取待处理的第一图像特征;
    变换模块,用于对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理依次包括第一非线性运算、卷积处理和逐点相乘运算;
    编码模块,用于根据所述经处理的图像特征进行编码得到码流。
  18. 根据权利要求17所述的装置,其特征在于,所述变换模块,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到所述经处理的图像特征。
  19. 根据权利要求17所述的装置,其特征在于,所述非线性变换处理还包括在所述逐点相乘运算之后的逐点相加运算。
  20. 根据权利要求19所述的装置,其特征在于,所述变换模块,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到第四图像特征,所述第四图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第四图像特征中相对应的多个特征值进行所述逐点相加运算得到所述经处理的图像特征。
  21. 根据权利要求17-20中任一项所述的装置,其特征在于,所述非线性变换处理还包括在所述卷积处理和所述逐点相乘运算之间的第二非线性运算,所述第二非线性运算和所述第一非线性运算相同或不同。
  22. 根据权利要求17-21中任一项所述的装置,其特征在于,所述第一非线性运算包括Sigmoid、Tanh或者分段线性映射。
  23. 根据权利要求21或22所述的装置,其特征在于,所述第二非线性运算包括Sigmoid、Tanh或者分段线性映射。
  24. 根据权利要求17-23中任一项所述的装置,其特征在于,还包括:
    训练模块,用于构建训练阶段的非线性变换单元,所述训练阶段的非线性变换单元包括第一非线性运算层、卷积处理层、逐点相乘运算层和逐点相加运算层;根据预先获取的训练数据训练得到经训练的非线性变换单元,所述经训练的非线性变换单元用于实现所述非线性变换处理。
  25. 一种解码装置,其特征在于,包括:
    获取模块,用于获取待处理的第一图像特征;
    变换模块,用于对所述第一图像特征进行非线性变换处理得到经处理的图像特征,所述非线性变换处理包括第一非线性运算、卷积处理和逐点相乘运算;
    重建模块,用于根据所述经处理的图像特征获取重建图像。
  26. 根据权利要求25所述的装置,其特征在于,所述变换模块,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到所述经处理的图像特征。
  27. 根据权利要求25所述的装置,其特征在于,所述非线性变换处理还包括在所述逐点相乘运算之后的逐点相加运算。
  28. 根据权利要求27所述的装置,其特征在于,所述变换模块,具体用于对所述第一图像特征中的每个特征值进行所述第一非线性运算得到第二图像特征;对所述第二图像特征进行所述卷积处理得到第三图像特征,所述第三图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第三图像特征中相对应的多个特征值进行所述逐点相乘运算得到第四图像特征,所述第四图像特征中的多个特征值和所述第一图像特征中的多个特征值相对应;将所述第一图像特征和所述第四图像特征中相对应的多个特征值进行所述逐点相加运算得到所述经处理的图像特征。
  29. 根据权利要求25-28中任一项所述的装置,其特征在于,所述非线性变换处理还包括在所述卷积处理和所述逐点相乘运算之间的第二非线性运算,所述第二非线性运算和所述第一非线性运算相同或不同。
  30. 根据权利要求25-29中任一项所述的装置,其特征在于,所述第一非线性运算包括Sigmoid、Tanh或者分段线性映射。
  31. 根据权利要求29或30所述的装置,其特征在于,所述第二非线性运算包括Sigmoid、Tanh或者分段线性映射。
  32. 根据权利要求25-31中任一项所述的装置,其特征在于,所述装置还包括:
    训练模块,用于构建训练阶段的非线性变换单元,所述训练阶段的非线性变换单元包括第一非线性运算层、卷积处理层、逐点相乘运算层和逐点相加运算层;根据预先获取的训练数据训练得到经训练的非线性变换单元,所述经训练的非线性变换单元用于实现所述非线性变换处理。
  33. 一种编码器,其特征在于,包括:
    一个或多个处理器;
    非瞬时性计算机可读存储介质,耦合到所述处理器并存储由所述处理器执行的程序,其中所述程序在由所述处理器执行时,使得所述解码器执行权利要求1-8中任一项所述的方法。
  34. 一种解码器,其特征在于,包括:
    一个或多个处理器;
    非瞬时性计算机可读存储介质,耦合到所述处理器并存储由所述处理器执行的程序,其中所述程序在由所述处理器执行时,使得所述解码器执行权利要求9-16中任一项所述的方法。
  35. 一种计算机程序产品,其特征在于,包括程序代码,当所述程序代码在计算机或处理器上执行时,用于执行权利要求1-16中任一项所述的方法。
  36. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1-16任意一项所述的方法。
PCT/CN2022/135204 2021-12-03 2022-11-30 图像编解码方法和装置 WO2023098688A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111470979.5A CN116260983A (zh) 2021-12-03 2021-12-03 图像编解码方法和装置
CN202111470979.5 2021-12-03

Publications (1)

Publication Number Publication Date
WO2023098688A1 true WO2023098688A1 (zh) 2023-06-08

Family

ID=86611535

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/135204 WO2023098688A1 (zh) 2021-12-03 2022-11-30 图像编解码方法和装置

Country Status (4)

Country Link
CN (1) CN116260983A (zh)
AR (1) AR127852A1 (zh)
TW (1) TWI826160B (zh)
WO (1) WO2023098688A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452696A (zh) * 2023-06-16 2023-07-18 山东省计算中心(国家超级计算济南中心) 一种基于双域特征采样的图像压缩感知重构方法及系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259904A (zh) * 2020-01-16 2020-06-09 西南科技大学 一种基于深度学习和聚类的语义图像分割方法及系统
WO2021077620A1 (zh) * 2019-10-22 2021-04-29 商汤国际私人有限公司 图像处理方法及装置、处理器、存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005354610A (ja) * 2004-06-14 2005-12-22 Canon Inc 画像処理装置、画像処理方法および画像処理プログラム
CN109658352B (zh) * 2018-12-14 2021-09-14 深圳市商汤科技有限公司 图像信息的优化方法及装置、电子设备和存储介质
CN111754592A (zh) * 2020-03-31 2020-10-09 南京航空航天大学 一种基于特征通道信息的端到端多光谱遥感图像压缩方法
CN113709455B (zh) * 2021-09-27 2023-10-24 北京交通大学 一种使用Transformer的多层次图像压缩方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021077620A1 (zh) * 2019-10-22 2021-04-29 商汤国际私人有限公司 图像处理方法及装置、处理器、存储介质
CN111259904A (zh) * 2020-01-16 2020-06-09 西南科技大学 一种基于深度学习和聚类的语义图像分割方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452696A (zh) * 2023-06-16 2023-07-18 山东省计算中心(国家超级计算济南中心) 一种基于双域特征采样的图像压缩感知重构方法及系统
CN116452696B (zh) * 2023-06-16 2023-08-29 山东省计算中心(国家超级计算济南中心) 一种基于双域特征采样的图像压缩感知重构方法及系统

Also Published As

Publication number Publication date
AR127852A1 (es) 2024-03-06
CN116260983A (zh) 2023-06-13
TW202324308A (zh) 2023-06-16
TWI826160B (zh) 2023-12-11

Similar Documents

Publication Publication Date Title
WO2022021938A1 (zh) 图像处理方法与装置、神经网络训练的方法与装置
KR20220137076A (ko) 이미지 프로세싱 방법 및 관련된 디바이스
WO2022155974A1 (zh) 视频编解码以及模型训练方法与装置
WO2021249290A1 (zh) 环路滤波方法和装置
WO2023098688A1 (zh) 图像编解码方法和装置
US20240161488A1 (en) Independent positioning of auxiliary information in neural network based picture processing
US20240037802A1 (en) Configurable positions for auxiliary information input into a picture data processing neural network
CN115604485A (zh) 视频图像的解码方法及装置
WO2023193629A1 (zh) 区域增强层的编解码方法和装置
US20240007637A1 (en) Video picture encoding and decoding method and related device
CN114915783A (zh) 编码方法和装置
TW202337211A (zh) 條件圖像壓縮
WO2022179509A1 (zh) 音视频或图像分层压缩方法和装置
WO2023066536A1 (en) Attention based context modelling for image and video compression
WO2022022176A1 (zh) 一种图像处理方法以及相关设备
WO2023172153A1 (en) Method of video coding by multi-modal processing
CN115409697A (zh) 一种图像处理方法及相关装置
WO2024012227A1 (zh) 应用于电子设备的图像显示方法、编码方法及相关装置
WO2023165487A1 (zh) 特征域光流确定方法及相关设备
TWI834087B (zh) 用於從位元流重建圖像及用於將圖像編碼到位元流中的方法及裝置、電腦程式產品
WO2023050433A1 (zh) 视频编解码方法、编码器、解码器及存储介质
US20240078414A1 (en) Parallelized context modelling using information shared between patches
WO2024083405A1 (en) Neural network with a variable number of channels and method of operating the same
WO2023160835A1 (en) Spatial frequency transform based image modification using inter-channel correlation information
CN118120233A (zh) 基于注意力的图像和视频压缩上下文建模

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22900492

Country of ref document: EP

Kind code of ref document: A1