CN113315970B

CN113315970B - Image compression method, image decoding method, intelligent terminal and storage medium

Info

Publication number: CN113315970B
Application number: CN202010121098.1A
Authority: CN
Inventors: 陈巍
Original assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Current assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2023-08-01
Anticipated expiration: 2040-02-26
Also published as: CN113315970A

Abstract

The invention discloses an image compression method, an image decoding method, an intelligent terminal and a storage medium, wherein a target image is acquired, and the target image is subjected to coding processing to acquire a plurality of feature images; clustering and quantizing the plurality of feature images to obtain quantized feature image data; carrying out probability estimation and arithmetic coding on the feature map data through a probability estimation network to obtain binary data, wherein the binary data is image compression data of the target image; acquiring the binary data; and obtaining a plurality of feature images after clustering quantization from the binary data through a probability estimation network and arithmetic decoding, and outputting a reconstructed decoded image corresponding to the target image. According to the invention, the image compression and decoding are carried out by combining the multi-scale self-coding network and the probability estimation network in a synchronous optimization way, and the probability estimation network can better carry out probability estimation on the data compressed by the lossy model, so that a better image processing effect is achieved.

Description

Image compression method, image decoding method, intelligent terminal and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image compression method, an image decoding method, an intelligent terminal, and a storage medium.

Background

The image compression model of the depth convolution self-coding network learns the data distribution of image data through the convolution self-coding network, and because the image is gaussian distributed, the image compression model can fit the distribution of image information through parameter learning, and the whole image compression model can be learnable end to end.

Image compression models based on depth convolution self-coding networks are widely used because the convolution networks have good characterization capability of image abstract features, but in order to enable the image compression models to compress feature data better, entropy coding (entropy coding is a coding method which does not lose any information according to an entropy principle in a coding process) is generally used for further compressing the data of the image compression models, and only entropy coding is lossless compression, namely, the entropy coding process does not have information loss.

At present, in a multi-scale image compression method, lossy compression (multi-scale compression model) and a probability estimation network (the probability estimation network is a network model for learning and estimating the occurrence probability of each pixel value of a feature image after image quantization, because entropy coding is a probability value of an object to be encoded) in the entropy coding need to be trained and optimized independently, namely, the multi-scale model is trained and optimized first, after the multi-scale compression model is trained, the multi-scale compression model is fixed to train and optimize the probability estimation network, and when the multi-scale compression model and the probability estimation network are jointly optimized, the problem that a decoded image is abnormal or the compressed file is doubled in size (because the original fixed value quantization causes that the multi-scale compression model and the probability estimation network cannot learn parameters of respective models along respective correct directions during joint training) occurs.

Accordingly, the prior art is still in need of improvement and development.

Disclosure of Invention

The invention mainly aims to provide an image compression method, an image decoding method, an intelligent terminal and a storage medium, and aims to solve the problem that in the prior art, the decoded image is abnormal or the size of a compressed file is doubled when a multi-scale compression model and a probability estimation network are jointly optimized.

In order to achieve the above object, the present invention provides an image compression method, comprising the steps of:

acquiring a target image, and performing coding processing on the target image to acquire a plurality of feature images;

clustering and quantizing the plurality of feature images to obtain quantized feature image data;

and carrying out probability estimation and arithmetic coding on the feature map data through a probability estimation network to obtain binary data, wherein the binary data is image compression data of the target image.

Optionally, in the image compression method, the acquiring a plurality of feature maps specifically includes:

and sequentially performing downsampling operation, convolution operation, normalization and nonlinear transformation and channel segmentation operation of the target image by preset multiples to obtain four feature images.

Optionally, in the image compression method, the performing downsampling operation, convolution operation, normalization and nonlinear transformation and channel segmentation operation of the target image by a preset multiple sequentially to obtain four feature maps specifically includes:

sequentially performing preset multiple downsampling operation, first convolution operation, normalization and nonlinear transformation and second convolution operation on the target image;

sequentially performing a preset multiple downsampling operation, normalization and nonlinear transformation, a third convolution operation and a first channel segmentation operation on the image subjected to the second convolution operation, and outputting a first feature map;

sequentially performing a preset multiple downsampling operation, normalization and nonlinear transformation, a fourth convolution operation and a second channel segmentation operation on the image subjected to the first channel segmentation operation, and outputting a second feature map;

sequentially performing a preset multiple downsampling operation, normalization and nonlinear transformation, a fifth convolution operation and a third channel segmentation operation on the image subjected to the second channel segmentation operation, and outputting a third feature map;

and sequentially performing a preset multiple downsampling operation, normalization and nonlinear transformation and a sixth convolution operation on the image subjected to the third channel segmentation operation, and outputting a fourth characteristic diagram.

Optionally, in the image compression method, the image after the third channel segmentation operation is sequentially subjected to a preset multiple downsampling operation, a normalization and nonlinear transformation and a sixth convolution operation, and a fourth feature map is output, and then the method further includes:

and respectively carrying out a first preset multiple downsampling operation, a second preset multiple downsampling operation and a third preset multiple downsampling operation on the first feature map, the second feature map and the third feature map so as to control the dimensions of the first feature map, the second feature map and the third feature map to be the same as the fourth feature map, and merging the dimensions of the first feature map, the second feature map and the third feature map with the fourth feature map in a channel dimension.

Optionally, in the image compression method, the preset multiple downsampling operation is a 2-time downsampling operation, and the 2-time downsampling operation is used for reducing the size of the image by half;

the convolution kernel in the first convolution operation has a size of 3*3, the number of output channels is 128, the step length is 1, and the pixel filling is 1;

the convolution kernel in the second convolution operation has a size of 3*3, the number of output channels is 64, the step size is 1, and the pixel filling is 1;

the convolution kernel in the third convolution operation has a size of 3*3, the number of output channels is 128+the number of channels of the first feature map, the step length is 1, and the pixel filling is 1;

The convolution kernel in the fourth convolution operation has a size of 3*3, the number of output channels is 256+the number of channels of the second feature map, the step length is 1, and the pixel filling is 1;

the convolution kernel in the fifth convolution operation has a size of 3*3, the number of output channels is 512+the number of channels of the third feature map, the step length is 1, and the pixel filling is 1;

the convolution kernel in the sixth convolution operation is 3*3, the number of output channels is the number of channels of the fourth feature map, the step size is 1, and the pixel filling is 1.

Optionally, the image compression method, wherein the first channel segmentation operation is configured to segment a tensor of a channel number of channels of the channel number 128+ of the first feature map into two tensors of a channel number of 128 and a channel number of the first feature map;

the second channel splitting operation is used for splitting tensors of the channel number 256+the second feature map into two tensors of the channel number 256 and the channel number of the second feature map;

the third channel splitting operation is configured to split a tensor of a channel number of the channel number 512+a channel number of the third feature map into two tensors of the channel number 512 and the channel number of the third feature map.

Optionally, in the image compression method, the clustering quantization processing is performed on the plurality of feature maps, and specifically includes:

acquiring a plurality of feature maps;

given cluster quantization center point c= { C ₁ ,c ₂ ,...,c _L Calculating the distance between each point in the input_x and the quantized center point, and setting the nearest center point as its quantized value, i.e., Q (input_x) _i ):＝argmin _j (input_x _i -c _j ) I represents the ith data of input_x, j represents the jth center of quantization, j e [1, L ]]L is the number of cluster quantization center points;

soft quantization is performed during training in the following mannerWherein σ is the hyper-parameter;

soft quantization transitions to hard quantization and rounding, stop_gradient (Q (input_x) _i )-soft_Q(input_x _i ))+soft_Q(input_x _i )；

Rounding round (x), and outputting quantized feature map data.

In addition, to achieve the above object, the present invention provides an image decoding method comprising the steps of:

acquiring binary data, wherein the binary data is image compression data of a target image;

and obtaining a plurality of feature images after clustering quantization from the binary data through a probability estimation network and arithmetic decoding, and outputting a reconstructed decoded image corresponding to the target image.

Optionally, the image decoding method, wherein the plurality of feature maps include: a first feature map, a second feature map, a third feature map, and a fourth feature map.

Optionally, in the image decoding method, the clustering quantized multiple feature maps are obtained by using a probability estimation network and arithmetic decoding on the binary data, and the reconstructed decoded image is output, which specifically includes:

sequentially performing seventh convolution operation, normalization and nonlinear transformation and first preset multiple up-sampling operation on the fourth characteristic diagram;

sequentially performing a first channel merging operation, an eighth convolution operation, normalization and nonlinear transformation and a second preset multiple upsampling operation on the image subjected to the first preset multiple upsampling operation and the third feature map;

sequentially performing second channel merging operation, ninth convolution operation, normalization and nonlinear transformation and third preset multiple up-sampling operation on the image subjected to the second preset multiple up-sampling operation and the second feature image;

sequentially performing third channel merging operation, tenth convolution operation, normalization and nonlinear transformation and fourth preset multiple up-sampling operation on the image subjected to the third preset multiple up-sampling operation and the first feature map;

and sequentially performing eleventh convolution operation, normalization and nonlinear transformation, twelfth convolution operation and fifth preset multiple up-sampling operation on the image subjected to the fourth preset multiple up-sampling operation, and outputting a reconstructed decoded image.

Optionally, in the image decoding method, a convolution kernel in the seventh convolution operation is 3*3, the number of output channels is 2048, a step size is 1, and a pixel filling is 1;

the convolution kernel in the eighth convolution operation has a size of 3*3, the number of output channels is 1024, the step size is 1, and the pixel filling is 1;

the convolution kernel in the ninth convolution operation has a size of 3*3, the number of output channels is 512, the step size is 1, and the pixel filling is 1;

the convolution kernel in the tenth convolution operation has a size of 3*3, the number of output channels is 256, the step size is 1, and the pixel filling is 1;

the convolution kernel in the eleventh convolution operation has a size of 3*3, the number of output channels is 128, the step size is 1, and the pixel filling is 1;

the convolution kernel in the twelfth convolution operation has a size of 3*3, the number of output channels is 12, the step size is 1, and the pixel fill is 1.

In addition, to achieve the above object, the present invention further provides an intelligent terminal, where the intelligent terminal includes: the image compression device comprises a memory, a processor and an image compression program or an image decoding program stored in the memory and capable of running on the processor, wherein the image compression program realizes the steps of the image compression method or the steps of the image decoding method when being executed by the processor.

In addition, in order to achieve the above object, the present invention also provides a storage medium storing an image compression program or an image decoding program which, when executed by a processor, implements the steps of the image compression method described above or which, when executed by a processor, implements the steps of the image decoding method described above.

In the present invention, the image compression method includes: acquiring a target image, and performing coding processing on the target image to acquire a plurality of feature images; clustering and quantizing the plurality of feature images to obtain quantized feature image data; and carrying out probability estimation and arithmetic coding on the feature map data through a probability estimation network to obtain binary data, wherein the binary data is image compression data of the target image. The image decoding method includes: acquiring binary data, wherein the binary data is image compression data of the target image; and obtaining a plurality of feature images after clustering quantization from the binary data through a probability estimation network and arithmetic decoding, and outputting a reconstructed decoded image corresponding to the target image. According to the invention, the image compression and decoding are carried out by combining the multi-scale self-coding network and the probability estimation network in a synchronous optimization way, and the probability estimation network can better carry out probability estimation on the data compressed by the lossy model, so that a better image processing effect is achieved.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the image compression method of the present invention;

FIG. 2 is a flow chart of a preferred embodiment of the image decoding method of the present invention;

fig. 3 is a schematic diagram of the entire process of performing image compression and image decoding in the image compression method and the image decoding method of the present invention;

FIG. 4 is a schematic diagram of an operating environment of a smart terminal according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, the image compression method according to the preferred embodiment of the present invention includes the following steps:

and S11, acquiring a target image, and performing encoding processing on the target image to acquire a plurality of feature images.

Specifically, the target image is sequentially subjected to a downsampling operation, a convolution operation, a normalization and nonlinear transformation and a channel segmentation operation which are preset multiples (preferably 2 times), so as to obtain four feature maps (i.e. the sizes of the four feature maps are different), namely the target image is preferably four feature maps, as shown in fig. 3, which are respectively: the first (C1), second (C2), third (C3) and fourth (C4) feature maps differ in scale, which refers to the dimensions of the width and height of the feature maps, since the width and height dimensions of the feature maps after different e_block processes are different, and since they are 2-fold downsampled, the width and height of the feature maps after adjacent two e_block processes are 2-fold.

E_block refers to a module in the encoding process that contains a 2-fold downsampling operation (space 2depth 2 ∈), a convolution operation (conv 3*3/channel number), a normalization and nonlinear transformation (BN+Relu), and a channel splitting operation (split) (where part of the E_block does not contain a split operation).

Among them, the main purpose of downsampling (space 2depth ∈x) is to shrink an image (or downsampled) to make the image conform to the size of the display area, and to generate a thumbnail of the corresponding image; for example, the target image size is 512×512, and then the image size is 256×256 after 2 times downsampling. In the invention, the meaning of space2depth 2 ∈2 is that the space (width and height of the feature map) is reduced by half through a certain data extraction mode, extracted data is rearranged into the dimension of a feature map placement channel, the total amount of data in the whole transformation mode is kept unchanged, and the number of the rearranged feature maps is 4 times that of the feature maps before transformation because the space width and height are reduced by half.

Wherein the convolution operation (conv): the convolution operation in the traditional mathematical definition comprises matrix point multiplication and addition operation, wherein the point multiplication is performed before the addition; the purpose of the convolution operation is to extract the features of the image, and different feature extraction graphs can be obtained according to different convolution kernels and different calculation modes.

Wherein the normalization and nonlinear transformation (bn+relu): normalization, i.e., batch normalization BN (batch normalization), is mainly performed on batch (batch) data in a model training process, nonlinear transformation, i.e., nonlinear operation, where a Relu function, i.e., having y=relu (x), is used, where the Relu function is a function running on neurons of an artificial neural network, responsible for mapping inputs of the neurons to outputs, and introducing an activation function to increase nonlinearity of the neural network model performs nonlinear transformation.

Wherein, channel split operation (split): the split () function is used for dividing data in a channel dimension, because the data dimension in the model processing process is data in four dimensions of NCHW, N represents batch size, namely the batch size, C is channel number channels, split operation is performed in the dimension C, and H and W are feature map height and width respectively; as shown in fig. 3, the processing objects of these operations are feature graphs, for example, a first e_block (e_block refers to a module in the encoding process), a target image is sent, the target image may also be referred to as a 3-channel feature graph, and assuming that the batch is 1, that is, one image is processed in turn, the original image width and height are 512, at this time, the dimension of the feature graph may be represented as (1, 3, 512, 512), after passing through space2depth × the feature graph is changed to 12, the data dimension is (1, 12, 256, 256), then a convolution operation is performed, since the number of convolution kernels is 128, the data dimension is (1, 128, 256, 256) after the convolution, as normalization and nonlinear operation are performed, this is an operation without changing the data dimension, the data dimension is still (1, 128, 256, 256), since the first e_block has no split operation, the dimension of the feature graph is represented as (1, 3, for example, the channel with split operation after the first e_block is processed, the second e_block is shown as fig. 3, and the feature graph is processed as two pieces of data (1, 128, 256, and the feature graph is obtained by processing the feature graph) of the two w+1, and the feature graph is obtained; where "+" in "128+c1" means "plus sign", and "C1" in "128+c1" means the number of channels of the first feature map, all "+" in the present invention mean "plus sign".

Further, as shown in fig. 3, the target image (the target image is an RGB three-channel image, a channel is also called a feature map) is input, and the target image is sequentially subjected to a first 2-fold downsampling operation (space 2depth 2 ∈, a scale parameter is 2), a first convolution operation (conv 3*3/128), a normalization and nonlinear transformation (bn+relu), and a second convolution operation (conv 3*3/64); sequentially performing a second 2-time downsampling operation (space 2depth 2 #), normalization and nonlinear transformation (BN+Relu), a third convolution operation (conv 3*3/128+C1) and a first channel segmentation operation (split) on the image subjected to the second convolution operation, and outputting a first feature map (C1); sequentially performing a third 2-time downsampling operation (space 2depth 2 ∈), normalization and nonlinear transformation (bn+relu), a fourth convolution operation (conv 3*3/256+c2) and a second channel segmentation operation (split) on the image subjected to the first channel segmentation operation, and outputting a second feature map (C2); sequentially performing a fourth 2-time downsampling operation (space 2depth 2 ∈), normalization and nonlinear transformation (bn+relu), a fifth convolution operation (conv 3*3/512+c3) and a third channel segmentation operation (split) on the image subjected to the second channel segmentation operation, and outputting a third feature map (C3); and sequentially performing a fifth 2 times downsampling operation (space 2depth 2), normalization and nonlinear transformation (BN+Relu) and a sixth convolution operation (conv 3*3/C4) on the image subjected to the third channel segmentation operation, and outputting a fourth characteristic diagram (C4).

Specifically, the convolution kernel size in the first convolution operation (conv 3*3/128) is 3*3, the number of output channels is 128, the step size is 1, and the pixel fill is 1; the convolution kernel size in the second convolution operation (conv 3*3/64) is 3*3, the number of output channels is 64, the step size is 1, and the pixel fill is 1; the convolution kernel size in the third convolution operation (conv 3*3/128+C1) is 3*3, the number of output channels is 128+the number of channels of the first feature map, the step size is 1, and the pixel filling is 1; the convolution kernel size in the fourth convolution operation (conv 3*3/256+C2) is 3*3, the number of output channels is 256+the number of channels of the second feature map, the step size is 1, and the pixel filling is 1; the convolution kernel size in the fifth convolution operation (conv 3*3/512+c3) is 3*3, the number of output channels is 512+the number of channels of the third feature map, the step size is 1, and the pixel filling is 1; the convolution kernel size in the sixth convolution operation (conv 3*3/C4) is 3*3, the number of output channels is the number of channels of the fourth feature map, the step size is 1, and the pixel fill is 1.

The step length (stride) is a length of the convolution operation in units of pixels, for example, the step length is 1, which indicates that the convolution kernel slides by one pixel unit to process the next pixel area after the current pixel area is processed in the feature map. Pixel padding (padding) is whether or not pixel padding is performed on the upper, lower, left, and right sides of the feature map at the time of the convolution operation.

Specifically, the first channel division operation is used for dividing the tensor of the channel number (C1) of the channel number 128+the first feature map into two tensors of the channel number 128 and the channel number (C1) of the first feature map; the second channel splitting operation is used for splitting tensors of the channel number of 256+the channel number (C2) of the second feature map into two tensors of the channel number of 256 and the channel number (C2) of the second feature map; the third channel splitting operation is used for splitting the tensor of the channel number 512+the channel number (C3) of the third feature map into two tensors of the channel number 512 and the channel number (C3) of the third feature map.

Where tensor is a proper term in deep learning, it refers to a multi-dimensional matrix, such as the feature map mentioned above is a tensor, which is a matrix expressed in terms of four dimensions of NCHW.

Further, the first feature map (C1), the second feature map (C2) and the third feature map (C3) are respectively subjected to a first preset multiple (preferably 8 times) downsampling operation, a second preset multiple (preferably 4 times) downsampling operation and a third preset multiple (preferably 2 times) downsampling operation, so as to control the dimensions of the first feature map (C1), the second feature map (C2) and the third feature map (C3) to be identical to the fourth feature map (C4), and to be combined with the fourth feature map (C4) in the channel dimension.

And step S12, clustering and quantizing the plurality of feature images to obtain quantized feature image data.

Firstly, for data generated in the encoding stage (the data here refers to a feature map after the last e_block processing), quantization processing is needed to implement data compression, and for input_x, the quantization mode of the original multi-scale model is as follows:

(1) Batch normalization BN (Batch Normalization ) operations;

(2)clip[0，u]where clip represents clipping and then maps to N, i.e Where N and u are both super-parameters, requiring manual setting, where N represents the quantized data range, e.g., n=7, representing quantization to [0,6 ]]U has no special meaning, in order to narrow the data range of the feature map, e.g., u is set to 4;

(3) Followed by soft quantization functionEnsuring error back propagation, where α is a super parameter, e.g., set to 0.5;

(4) And finally rounding operation round (x).

However, the quantization processing manner will cause that the multi-scale model and the probability estimation network (such as the PixelCNN model, the Parallel multi-scale Pixel CNN network, a multi-scale neural network capable of generating a plurality of Pixel values in Parallel) cannot be jointly optimized in network training, which will affect the probability estimation network to perform better probability estimation modeling on quantized data, and cannot realize better compression; therefore, based on the thought of a clustering method, the method adopts a new quantization mode, and solves the problem that two networks cannot be jointly optimized through clustering quantization.

Namely, the clustering quantization mode adopted by the invention is as follows:

(1) Given cluster quantization center point c= { C ₁ ,c ₂ ,...,c _L Calculating the distance between each point in the input_x and the quantized center point, and setting the nearest center point as its quantized value, i.e., Q (input_x) _i ):＝argmin _j (input_x _i -c _j ) I represents the ith data of input_x, j represents the jth center of quantization, j e [1, L ]]L is the number of cluster quantization center points;

(2) In training, soft quantization is also required because of the guaranteed error back propagationWhere σ is a hyper-parameter, e.g. a value set to 1.0;

(3) Then, soft quantization is transited to hard quantization and rounded, stop_gradient (Q (input_x) _i )-soft_Q(input_x _i ) This mathematical expression refers to the pair Q (input_x) _i )-soft_Q(input_x _i ) Is not gradient tracked, i.e. the part is not derived, which is a mathematical representation of the processing in the program code；

stop_gradient(Q(input_x _i )-soft_Q(input_x _i ))+soft_Q(input_x _i ) This mathematical expression means that the quantized calculation is Q (input_x) in forward propagation _i ) In the case of the inverse calculation of the gradient, the quantized calculation is soft_q (input_x) _i ) In order to ensure that the quantization processing is conductive in the chain derivation process of the model;

(4) And finally rounding round (x) and outputting quantized feature map data.

And S13, carrying out probability estimation and arithmetic coding on the feature map data through a probability estimation network to obtain binary data, wherein the binary data is image compression data of the target image.

In the invention, the self-coding network is composed of a downsampled coding network and an upsampled decoding network, the coding and decoding processes are usually symmetrical structures, while the multi-scale self-coding network is characterized in that the characteristic diagrams of different scales extracted according to the coding process of the self-coding network are reserved according to a super parameter on the basis of the self-coding network and are sent into a network model of the decoding network.

Further, the invention also discloses an image decoding method, as shown in fig. 2, according to the preferred embodiment of the invention, the image decoding method comprises the following steps:

step S21, binary data is acquired, wherein the binary data is image compression data of a target image.

In the present invention, the image decoding method and the image compression method are processes corresponding to each other, and the binary data is obtained after the image compression is completed, that is, the binary data is the image compression data of the target image, so the image decoding method needs to acquire the binary data first.

And S22, obtaining a plurality of feature images after clustering quantization from the binary data through a probability estimation network and arithmetic decoding, and outputting a reconstructed decoded image corresponding to the target image.

Specifically, after the feature map is subjected to cluster quantization, probability estimation is performed on the feature map by a probability estimation network (the probability estimation is performed for entropy coding, because the entropy coding is performed by knowing the probability of the data to be coded or the symbol), the probability estimation network adopts a parallel PixelCNN model, and the quantized feature map data is converted into binary data and stored by conventional arithmetic coding after the probability estimation; and decoding the binary data by combining a probability estimation network and arithmetic decoding, so as to obtain a quantized first characteristic diagram (C1), a quantized second characteristic diagram (C2), a quantized third characteristic diagram (C3) and a quantized fourth characteristic diagram (C4).

That is, the obtained C1, C2, C3, C4 are sequentially subjected to d_block processing, as shown in fig. 3, to finally obtain a reconstructed decoded image, wherein d_block is a module at the decoding end, which includes a convolution operation (conv 3*3/channel number), normalization and nonlinear transformation (bn+relu), 2-fold up-sampling (depth 2 space), and channel merging operation (concat, part of the modules have no channel merging operation). The convolution, normalization and nonlinear transformation are described in e_block in the coding end, and the difference between the e_block and the e_block is that the channel merging operation and the channel up-sampling are the operations of being the reciprocal of the down-sampling and the channel separation in the e_block, for example, the channel merging operation is to merge the feature images in the channel dimension, that is, the channel dimension is added, for example, the data dimension of the feature image a is (N, 32, h, w), the data dimension of the feature image B is (N, 64, h, w), and then the AB two feature images form a new feature image C after the channel dimension is merged, and the data dimension of C is (N, 96, h, w); for the up-sampling operation of space2, which is also an inverse operation of space2, the same value and permutation rule as the space2, the data of channel dimension are combined in two dimensions of width and height, and the number of channels is changed to 1/4 as the data is up-sampled by 2 times, namely the width and height are changed to 2 times as the original number; likewise, the object operated by the D_block module is also a feature map, and the data is a tensor of four dimensions (N, C, H, W).

Further, as shown in fig. 3, the fourth feature map (C4) is sequentially subjected to a seventh convolution operation (conv 3*3/2048), normalization and nonlinear transformation (bn+relu), and a first 2-time upsampling operation (the preset multiple upsampling operation in the present invention is preferably a 2-time upsampling operation, that is, depth 2space2 #, and a downsampling operation, that is, image magnification is performed); sequentially performing a first channel merging operation (concat), an eighth convolution operation (conv 3 x 3/1024), a normalized nonlinear transformation (bn+relu) and a second 2-fold upsampling operation (depth 2space2 #) on the image after the first 2-fold upsampling operation and the third feature map (C3); sequentially performing a second channel merging operation (concat), a ninth convolution operation (conv 3*3/512), a normalized and nonlinear transformation (bn+relu) and a third 2-fold up-sampling operation (depth 2space2 #) on the image subjected to the second 2-fold up-sampling operation and the second feature map (C2); sequentially performing a third channel merging operation (concat), a tenth convolution operation (conv 3*3/256), a normalized nonlinear transformation (bn+relu) and a fourth 2-fold upsampling operation (depth 2space2 #) on the image after the third 2-fold upsampling operation and the first feature map (C1); and sequentially performing eleventh convolution operation (conv 3*3/128), normalization and nonlinear transformation (BN+Relu), twelfth convolution operation (conv 3*3/12) and fifth 2-time up-sampling operation (depth 2space2 #) on the image subjected to the fourth 2-time up-sampling operation, and outputting a reconstructed decoded image (reconstruction, namely, a 3-channel RGB image).

Wherein the convolution kernel size in the seventh convolution operation (conv 3*3/2048) is 3*3, the number of output channels is 2048, the step size is 1, and the pixel fill is 1; the convolution kernel size in the eighth convolution operation (conv 3*3/1024) is 3*3, the number of output channels is 1024, the step size is 1, and the pixel filling is 1; the convolution kernel size in the ninth convolution operation (conv 3*3/512) is 3*3, the number of output channels is 512, the step size is 1, and the pixel fill is 1; the convolution kernel size in the tenth convolution operation (conv 3*3/256) is 3*3, the number of output channels is 256, the step size is 1, and the pixel fill is 1; the convolution kernel size in the eleventh convolution operation (conv 3*3/128) is 3*3, the number of output channels is 128, the step size is 1, and the pixel fill is 1; the convolution kernel size in the twelfth convolution operation (conv 3*3/12) is 3*3, the number of output channels is 12, the step size is 1, and the pixel fill is 1.

For the channel merging operation (concat), that is, merging in the channel dimension, for example, the channel merging operation (concat) merges two tensors with the channel number of 512 and the channel number of C4 in the channel dimension to obtain one tensor with the channel number of 512+c4; similarly, all D_blocks are processed to obtain a 3-channel RGB image.

Further, as shown in fig. 4, based on the above image compression method, the present invention further provides an intelligent terminal, which includes a processor 10, a memory 20 and a display 30. Fig. 4 shows only some of the components of the intelligent terminal, but it should be understood that not all of the illustrated components are required to be implemented, and more or fewer components may alternatively be implemented.

The memory 20 may in some embodiments be an internal storage unit of the smart terminal, such as a hard disk or a memory of the smart terminal. The memory 20 may also be an external storage device of the Smart terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like. Further, the memory 20 may also include both an internal storage unit and an external storage device of the smart terminal. The memory 20 is used for storing application software installed in the intelligent terminal and various data, such as program codes for installing the intelligent terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In an embodiment, the memory 20 stores an image compression program or an image decoding program 40, and the image compression program or the image decoding program 40 may be executed by the processor 10, thereby implementing the image compression method or the image decoding method in the present application.

The processor 10 may in some embodiments be a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chip for executing program code or processing data stored in the memory 20, for example for performing the image compression method or the image decoding method, etc.

The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like in some embodiments. The display 30 is used for displaying information on the intelligent terminal and for displaying a visual user interface. The components 10-30 of the intelligent terminal communicate with each other via a system bus.

In one embodiment, the following steps are implemented when the processor 10 executes the image compression program 40 in the memory 20:

The obtaining a plurality of feature graphs specifically includes:

The step of sequentially performing downsampling operation, convolution operation, normalization and nonlinear transformation and channel segmentation operation of the target image by a preset multiple to obtain four feature images specifically includes:

And sequentially performing a preset multiple downsampling operation, normalization and nonlinear transformation and a sixth convolution operation on the image subjected to the third channel segmentation operation, and outputting a fourth feature map, wherein the method further comprises the following steps:

The preset multiple downsampling operation is 2 times downsampling operation, and the 2 times downsampling operation is used for reducing the size of an image by half;

The first channel segmentation operation is used for segmenting tensors of the channel number 128+the channel number of the first feature map into two tensors of the channel number 128 and the channel number of the first feature map;

The clustering quantization processing for the plurality of feature images specifically comprises the following steps:

acquiring a plurality of feature maps;

Rounding round (x), and outputting quantized feature map data.

Or in another embodiment the following steps are implemented when the processor 10 executes the image decoding program 40 in said memory 20:

Wherein the plurality of feature maps comprises: a first feature map, a second feature map, a third feature map, and a fourth feature map.

The binary data is subjected to a probability estimation network and arithmetic decoding to obtain a plurality of feature images after clustering quantization, and a reconstructed decoded image is output, and the method specifically comprises the following steps:

The convolution kernel in the seventh convolution operation has a size of 3*3, the number of output channels is 2048, the step size is 1, and the pixel filling is 1;

The present invention also provides a storage medium storing an image compression program which, when executed by a processor, implements the steps of the image compression method as described above.

In summary, the present invention provides an image compression method, an image decoding method, an intelligent terminal and a storage medium, where the image compression method includes: acquiring a target image, and performing coding processing on the target image to acquire a plurality of feature images; clustering and quantizing the plurality of feature images to obtain quantized feature image data; and carrying out probability estimation and arithmetic coding on the feature map data through a probability estimation network to obtain binary data, wherein the binary data is image compression data of the target image. The image decoding method includes: acquiring binary data, wherein the binary data is image compression data of the target image; and obtaining a plurality of feature images after clustering quantization from the binary data through a probability estimation network and arithmetic decoding, and outputting a reconstructed decoded image. According to the invention, the image compression and decoding are carried out by combining the multi-scale self-coding network and the probability estimation network in a synchronous optimization way, and the probability estimation network can better carry out probability estimation on the data compressed by the lossy model, so that a better image processing effect is achieved.

Of course, those skilled in the art will appreciate that implementing all or part of the above-described methods may be implemented by a computer program for instructing relevant hardware (such as a processor, a controller, etc.), where the program may be stored in a computer-readable storage medium, and where the program may include the steps of the above-described method embodiments when executed. The storage medium may be a memory, a magnetic disk, an optical disk, or the like.

It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

Claims

1. An image compression method, characterized in that the image compression method comprises the steps of:

carrying out probability estimation and arithmetic coding on the feature map data through a probability estimation network to obtain binary data, wherein the binary data is image compression data of the target image;

acquiring a plurality of feature maps;

given cluster quantization center point c= { C ₁ ,c ₂ ,...,c _L Calculating the distance between each point in the input_x and the quantized center point, and setting the nearest center point as its quantized value, i.e., Q (input_x) _i ):＝arg min _j (input_x _i -c _j ) I represents the ith data of input_x, j represents the jth center of quantization, j e [1, L ]]L is a clusterQuantifying the number of center points;

soft quantization transitions to hard quantization and rounding,

stop_gradient(Q(input_x _i )-soft_Q(input_x _i ))+soft_Q(input_x _i ) Finger pair Q (input_x) _i )-soft_Q(input_x _i ) The result of (2) does not carry out gradient tracking, namely does not carry out derivative on the part, and is a mathematical expression mode of the processing process in the program code;

stop_gradient(Q(input_x _i )-soft_Q(input_x _i ))+soft_Q(input_x _i ) Is the computation of quantization in forward propagation is Q (input_x) _i ) In the case of the inverse calculation of the gradient, the quantized calculation is soft_q (input_x) _i ) In order to ensure that the quantization processing is conductive in the chain derivation process of the model;

rounding round (x), and outputting quantized feature map data.

2. The image compression method according to claim 1, wherein the obtaining a plurality of feature maps specifically includes:

3. The image compression method according to claim 2, wherein the sequentially performing a downsampling operation, a convolution operation, a normalization and nonlinear transformation, and a channel segmentation operation on the target image by a preset multiple to obtain four feature maps specifically includes:

4. The image compression method according to claim 3, wherein the image subjected to the third channel segmentation operation is sequentially subjected to a preset multiple downsampling operation, a normalization and nonlinear transformation operation, and a sixth convolution operation, and a fourth feature map is output, and further comprising:

5. The image compression method according to claim 3, wherein the preset multiple downsampling operation is a 2-times downsampling operation for reducing the size of the image by half;

6. The image compression method according to claim 3, wherein the first channel division operation is for dividing a tensor of a channel number of channels of the channel number 128+ first feature map into two tensors of the channel number 128 and the channel number of the first feature map;

7. An image decoding method, characterized in that the image decoding method comprises the steps of:

the binary data are subjected to probability estimation network and arithmetic decoding to obtain a plurality of feature images after clustering quantization, and reconstructed decoded images corresponding to the target images are output;

the clustering quantization specifically comprises the following steps:

acquiring a plurality of feature maps;

soft quantization transitions to hard quantization and rounding, stop_gradient (Q (input_x) _i )-soft_Q(input_x _i ))+soft_Q(input_x _i ) Finger pair Q (input_x) _i )-soft_Q(input_x _i ) The result of (2) does not carry out gradient tracking, namely does not carry out derivative on the part, and is a mathematical expression mode of the processing process in the program code;

rounding round (x), and outputting quantized feature map data.

8. The image decoding method of claim 7, wherein the plurality of feature maps comprises: a first feature map, a second feature map, a third feature map, and a fourth feature map.

9. The image decoding method according to claim 8, wherein the clustering quantized feature maps are obtained by performing a probability estimation network and an arithmetic decoding on the binary data, and the reconstructed decoded image is output, specifically comprising:

10. The image decoding method according to claim 9, wherein the convolution kernel in the seventh convolution operation has a size of 3*3, the number of output channels is 2048, the step size is 1, and the pixel fill is 1;

11. An intelligent terminal, characterized in that, the intelligent terminal includes: memory, a processor and an image compression program or an image decoding program stored on the memory and executable on the processor, the image compression program implementing the steps of the image compression method according to any one of claims 1 to 6 when executed by the processor or the image decoding program implementing the steps of the image decoding method according to any one of claims 7 to 10 when executed by the processor.

12. A storage medium storing an image compression program or an image decoding program, which when executed by a processor implements the steps of the image compression method according to any one of claims 1 to 6 or which when executed by a processor implements the steps of the image decoding method according to any one of claims 7 to 10.