WO2023050720A1

WO2023050720A1 - Image processing method, image processing apparatus, and model training method

Info

Publication number: WO2023050720A1
Application number: PCT/CN2022/078897
Authority: WO
Inventors: 任聪; 刘衡祁; 徐科; 孔德辉; 宋剑军; 易自尧; 杨维
Original assignee: 深圳市中兴微电子技术有限公司
Priority date: 2021-09-28
Filing date: 2022-03-02
Publication date: 2023-04-06
Also published as: CN115880381A

Abstract

An image processing method, an image processing apparatus, and a model training method. The image processing method comprises: obtaining an image to be processed, said image being obtained by decoding an original image (S110); obtaining encoding unit division information of the original image during encoding, the encoding unit division information comprising first position information and first size information of each encoding unit (S120); dividing said image according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the encoding unit (S130); and establishing a relationship between the plurality of feature blocks by means of a self-attention mechanism of a Transformer module to obtain a first output image corresponding to the original image (S140).

Description

Image processing method, image processing device, model training method

Cross References to Related Applications

This application is based on a Chinese patent application with application number 202111144470.1 and a filing date of September 28, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

The present application relates to the technical field of image processing, and in particular to an image processing method, an image processing device, and a model training method.

Background technique

With the continuous development of technology, people's demand for image quality is getting higher and higher. If the amount of data is too large, under the influence of factors such as network bandwidth or storage space, it is easy to cause difficulties in transmission or storage, such as uncompressed The data volume of digital video is huge, so the original data needs to be encoded and compressed in the process of data transmission or storage to remove the redundancy of space and time dimensions, and the compressed data is transmitted from the encoding end to the decoding end through the transmission system. Decoding can restore the original data. In order to better improve the coding quality, the image usually adopts a quadtree block partition structure based on the coding unit (Coding Unit, CU), divides the best coding unit with the minimum rate distortion cost, and codes each coding unit separately , that is to use a block-based coding method for coding, and as the bit rate decreases, the quantization becomes rough, and there will be discontinuities at the boundaries of the blocks, forming obvious defects in the reconstructed image, that is, block effects, due to the difference between adjacent blocks There are obvious differences between the imagination, so that the original video will be distorted after encoding and decoding, resulting in poor user experience.

In some cases, the image quality enhancement algorithm used can optimize the codec results, such as histogram equalization or gamma correction, but the above-mentioned algorithm mainly performs image enhancement by artificially summarizing experience and human eye characteristics, which is very difficult. To a large extent, it is constrained by the image scene, which limits the ability to improve the image quality. In addition, the convolutional neural network in deep learning is also used to enhance the image quality. When the convolution extracts features, it is extracted through the local receptive field. To a certain extent On the contrary, the correlation between blocks is ignored, and the image quality is still difficult to guarantee.

Contents of the invention

In view of this, the present application proposes an image processing method, an image processing device, a model training method, a training device, and a computer-readable storage medium.

In the first aspect, an embodiment of the present application provides an image processing method, the method comprising: acquiring an image to be processed, the image to be processed is obtained by decoding an original image; acquiring a coding unit of the original image when encoding Splitting information, the coding unit splitting information includes first position information and first size information of each coding unit; divide the image to be processed according to the first position information and the first size information to obtain multiple A feature block corresponding to the encoding unit; a connection between multiple feature blocks is established through the self-attention mechanism of the Transformer module to obtain a first output image corresponding to the original image.

In the second aspect, the embodiment of the present application provides an image processing device, including a division module and a Transformer module, the division module is configured to obtain an image to be processed obtained after the original image is decoded and processed, and obtain the original image in Coding unit division information during encoding, the coding unit division information includes first position information and first size information of each coding unit, and according to the first position information and the first size information, the image to be processed is performing division to obtain a plurality of feature blocks corresponding to the coding unit; the Transformer module is configured to establish a connection between a plurality of feature blocks through a self-attention mechanism, and obtain the first feature block corresponding to the original image output image.

In a third aspect, an embodiment of the present application provides a model training method, the model includes a Transformer module, and the method includes: acquiring an image to be processed, the image to be processed is a training sample in a constructed training set, wherein the The image to be processed is obtained by decoding the original image; obtaining coding unit division information of the original image during encoding, the coding unit division information including first position information and first size information of each coding unit; The image to be processed and the coding unit division information are input into the model, and the image to be processed is divided according to the first position information and the first size information to obtain a plurality of features corresponding to the coding unit block; through the self-attention mechanism of the Transformer module, a connection between a plurality of the feature blocks is established to obtain a first output image corresponding to the original image; according to the first output image and the objective function, the The model is trained to obtain the trained model.

In a fourth aspect, an embodiment of the present application provides an image processing device, including at least one control processor and a memory for communicating with the at least one control processor; the memory stores information that can be processed by the at least one control processor. Instructions executed by a device, the instructions are executed by the at least one control processor, so that the at least one control processor can execute the image processing method as described in the first aspect above.

In a fifth aspect, the embodiment of the present application provides a training device, including at least one control processor and a memory for communicating with the at least one control processor; the memory stores information that can be controlled by the at least one control processor Executable instructions, the instructions are executed by the at least one control processor, so that the at least one control processor can execute the model training method as described in the third aspect above.

In the sixth aspect, the embodiment of the present application provides a computer-readable storage medium, which stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute the image processing method described in the above first aspect or the above third aspect The described model training method.

Additional features and advantages of the application will be set forth in the description which follows, and, in part, will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Description of drawings

The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.

Below in conjunction with accompanying drawing and embodiment the application is further described;

Fig. 1 is a flowchart of steps of an image processing method provided by an embodiment of the present application;

FIG. 2 is a flow chart of the steps of an image processing method provided in another embodiment of the present application;

Fig. 3 is a flow chart of steps of an image processing method provided by another embodiment of the present application;

FIG. 4 is a flow chart of the steps of an image processing method provided in another embodiment of the present application;

FIG. 5 is a flow chart of the steps of an image processing method provided by another embodiment of the present application;

Fig. 6 is a schematic structural diagram of an image processing device provided by another embodiment of the present application;

Fig. 7 is a schematic structural diagram of a Transformer module provided by another embodiment of the present application;

Fig. 8 is a flow chart of the steps of the model training method provided by another embodiment of the present application;

Fig. 9 is a flow chart of the steps of the model training method provided by another embodiment of the present application;

Fig. 10 is a flow chart of the steps of the model training method provided by another embodiment of the present application;

Fig. 11 is a flow chart of the steps of the model training method provided by another embodiment of the present application;

Fig. 12 is a schematic structural diagram of an image processing device provided by another embodiment of the present application;

Fig. 13 is a schematic structural diagram of a training device provided by another embodiment of the present application.

Detailed ways

This part will describe the embodiments of the application in detail. Several embodiments of the application are shown in the accompanying drawings. Each technical feature and the overall technical solution, but they should not be construed as limiting the protection scope of this application.

In the description of this application, if the first and second are described only for the purpose of distinguishing technical features, it cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features or implicitly indicating The sequence of the indicated technical features. It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, it can be executed in a different order than the module division in the device or the flowchart in the flowchart. steps shown or described.

In the description of this application, unless otherwise clearly defined, words such as setting, installation, and connection should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above words in this application in combination with the specific content of the technical solution.

The embodiments of the present application will be further described below in conjunction with the accompanying drawings.

Referring to Fig. 1, the first aspect embodiment of the present application provides an image processing method, including but not limited to step S110, step S120, step S130 and step S140:

Step S110: Acquire the image to be processed, which is obtained from the original image after decoding;

It should be noted that in order to improve the transmission efficiency and storage reliability of image data, the original image generally needs to be encoded and decoded. Since encoding and decoding will cause certain loss to the original image, it will affect the image quality and bring inconvenience to the user. A good sense of experience, so it is necessary to enhance the picture quality of the image obtained after decoding. The picture quality is the picture quality, which is related to the degree of loss of the picture. Corresponding reduction. It can be understood that the image to be processed may be a text image or a video image.

Step S120: Obtain coding unit division information of the original image during encoding, where the coding unit division information includes first position information and first size information of each coding unit;

It should be noted that in order to better improve the coding quality, the quadtree block partition structure based on the coding unit is adopted to divide the optimal coding unit with the minimum rate-distortion cost, and encode each coding unit separately. It flexibly adapts to the texture features of various images and significantly improves the coding efficiency. The divided coding units support different sizes. The advantage of this division is that, on the one hand, a larger coding unit can greatly improve the coding efficiency of a flat area, and on the other hand, a smaller coding unit can handle image localities well. details, which can make the prediction of complex images more accurate. By acquiring the coding unit division information of the original image during encoding, the first position information and the first size information of each coding unit can be obtained, so that the spatial information of the coding unit can be clearly understood.

Step S130: Divide the image to be processed according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the coding units;

Since the original image is coded and decoded, it will produce block effects, which will affect the quality of the output image. In some cases, the convolutional neural network is used to enhance the quality of the image. When the convolution extracts features, it is extracted through the local receptive field. The method ignores the correlation between blocks to a certain extent, and the convolutional neural network generally enhances the image quality based on fixed feature blocks, resulting in the final enhanced and optimized image not as expected. However, in the embodiment of the present application, the local features of the image to be processed are extracted by combining the coding unit division information, that is, the local features are extracted according to the first position information and the first size information, so as to obtain a plurality of feature blocks corresponding to the coding units. It can be understood that , the division method of the feature block is determined according to the division information of the coding unit, so that each feature block corresponds to the coding unit, and the correlation between the feature blocks obtained by combining position and size information is stronger.

Step S140: establish the connection between multiple feature blocks through the self-attention mechanism of the Transformer module, and obtain a first output image corresponding to the original image.

It should be noted that the self-attention mechanism in the Transformer module in natural language processing tasks can effectively overcome the limitations brought about by the convolutional inductive bias, and take more into account the global information of the language. Therefore, in order to learn and infer non-local components, the embodiment of the present application establishes the connection between multiple feature blocks through the Transformer module. Since the feature blocks are divided according to the coding unit division information, the self-attention mechanism of the Transformer module can obtain The long-distance dependence of the coding unit in the coding code, and learning the correlation between different feature blocks to establish global information, so that the established global information can be more in line with the rules of coding, thereby greatly reducing the difference between adjacent blocks , making the transition between blocks smoother.

According to the solution provided by the embodiment of the present application, by obtaining the coding unit division information of the original image during encoding, the coding unit division information includes the first position information and the first size information of each coding unit, and the image to be processed is divided according to the coding unit division information Divide into multiple feature blocks to make full use of the local coding information, so that the divided feature blocks correspond to the coding units, and then use the self-attention mechanism of the Transformer module to establish the connection between the feature blocks, that is, to establish global information, through The interaction between local information and global information can better remove the difference between adjacent feature blocks, make the transition between blocks smoother, and thus better enhance the image quality of the encoded and decoded image.

In the above image processing method, in step S130, the image to be processed is divided according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the coding units, including:

The image to be processed is divided into a plurality of characteristic blocks according to the first position information and the first size information, so that the positions and sizes of the characteristic blocks are the same as the coding units divided into the original image during encoding.

It should be noted that the coding unit division information includes the first position information and the first size information of each coding unit, the image to be processed is divided according to the first position information and the first size information, and the local coding information is effectively used to obtain A plurality of feature blocks corresponding to the coding units divided by the original image during encoding, each feature block corresponds to each coding unit position one by one and the size is consistent, that is, the divided feature blocks CU ₁ , CU ₂ , … , CU _n are consistent with the encoding, and the correlation between these feature blocks is established through the self-attention mechanism in Transformer, which contains rich global information.

As shown in FIG. 2, in the above-mentioned image processing method, before the connection between multiple feature blocks is established through the self-attention mechanism of the Transformer module in step S140, it also includes but is not limited to step S210 and step S220:

Step S210: Flatten multiple feature blocks into multiple first feature data to obtain a first feature sequence, wherein the first feature data is represented by a one-dimensional vector;

Step S220: Input the first feature sequence to the Transformer module.

It should be noted that the divided feature blocks are two-dimensional vector data, and multiple feature blocks need to be flattened into first feature data represented by one-dimensional vectors, and the first feature sequence is composed of multiple first feature data, which is convenient The above-mentioned first feature sequence is input into the Transformer module for image enhancement processing, and the correlation between different one-dimensional data is learned through Transformer's self-attention mechanism, thereby establishing global information. Specifically, each CU ₁ , CU ₂ , ..., CU _n is flattened into corresponding CU _f1 , CU _f2 , ..., CU _fn to obtain a one-dimensional data sequence [CU _f1 , CU _f2 , .. ., CU _fn ], since the size of the feature block division may be different, the length of the first feature data may also be inconsistent.

As shown in Figure 3, in the above image processing method, in step S140, the self-attention mechanism of the Transformer module establishes the connection between multiple feature blocks, including but not limited to step S310, step S320, step S330, step S340 and step S350:

Step S310: According to the first feature data and the first preset matrix, obtain a second feature sequence composed of a plurality of second feature data with the same length;

Step S320: Establish the correlation between multiple second feature data through the self-attention mechanism of the Transformer module, and obtain a third feature sequence through residual connection and transformation processing, wherein the third feature sequence is composed of multiple third feature data composition;

Step S330: According to the third feature data and the second preset matrix, a fourth feature sequence composed of a plurality of fourth feature data is obtained, wherein the fourth feature data is represented by a one-dimensional vector;

Step S340: Restore the fourth feature data into a feature block represented by a two-dimensional vector;

Step S350: Obtain the first output image according to the plurality of feature blocks.

It should be noted that, when the first feature sequence [CU _f1 , CU _f2 , ..., CU _fn ] is input to the Transformer module, since the length of the first feature data may be inconsistent, the multiple first The feature data are all converted into the same length, the first feature data is expressed in the form of a row matrix, and multiple first feature data are multiplied by the corresponding first preset matrix, the number of rows of the first preset matrix is equal to the first feature The number of columns of the data is the same, and the number of columns of the first preset matrix is a preset length, so that a plurality of second feature data of the same length are calculated to form a second feature sequence. Specifically, the first preset matrix is a series of matrices of len _f1 ×d _model , len _f2 ×d _model , ..., len _fn ×d _model , where len _f1 , len _f2 , ..., len _fn are Corresponding to the length of CU _f1 , CU _f2 , ..., CU _fn , that is, the number of rows of the first preset matrix is the same as the number of columns of the first characteristic data, and d _model is a preset length, and d _model can be set according to actual needs It is determined that d _model = 1024 is set in the embodiment of the present application, so that the first feature data in the first feature sequence [CU _f1 , CU _f2 , ..., CU _fn ] are multiplied by the corresponding first preset matrix respectively, That is, the lengths of the second feature data in the second feature sequence [CU _{em_1} , CU _{em_2} , . . . , CU _{em_n} ] can be unified. Then use the self-attention mechanism of the Transformer module to perform information interaction between different second feature data, so as to obtain global information, through residual connection and normalization steps, and then through nonlinear transformation to obtain the output third feature sequence [CU _{en_1} , CU _{en_2} , . . . , CU _{en_n} ], it should be noted that n is the number of divided feature blocks, which are optimally generated according to different coding objects.

By restoring the third feature sequence to the original size of each feature block, first converting the multiple third feature data of the third feature sequence into the original length, multiplying the multiple third feature data with the corresponding second preset matrix, The second preset array is a series of matrices of d _model ×len _f1, d _model ×len _f2 ,..., d _model ×len _fn , so as to obtain a plurality of fourth feature data with inconsistent lengths, restore them to the original length, and A one-dimensional data sequence [CU _p1 , CU _p2 , . . . , CU _pn ] is formed, that is, the fourth feature sequence. Then, according to the characteristics of the coding unit size 2n×2n, where n=4, 8, 16 or 32, the feature block represented by the two-dimensional vector is restored, that is, the original size is restored. Finally, multiple feature blocks are combined into a complete image, so that the obtained first output image FM _p has the same size as the original image.

As shown in Figure 4, in the above-mentioned image processing method, in step S320, the self-attention mechanism of the Transformer module is used to establish the correlation between multiple second feature data, including but not limited to step S410 and step S420:

Step S410: Obtain the second position information and second size information of each feature block during division;

Step S420: According to the second position information and the second size information, establish the correlation between multiple second feature data through the self-attention mechanism of the Transformer module.

It should be noted that the second position information and the second size information correspond to the first position information and the first size information respectively, and by obtaining the second position information and the second size information of each feature block, the space of the feature block can be clearly Information, through the self-attention mechanism of the Transformer module, the correlation between multiple second feature data is established, and the second position information and the second size information are combined to facilitate the information interaction between adjacent feature blocks.

In step S350, the first output image is obtained according to a plurality of feature blocks, including:

The plurality of feature blocks are stitched into a first output image according to the second position information.

By acquiring the second position information and splicing the feature blocks according to the second position information, the position representation of the feature blocks in the two-dimensional space can be enhanced, which is beneficial to greatly improve the image processing efficiency.

In the above-mentioned image processing method, the following steps are also included:

Perform detail enhancement processing on the first output image to obtain a second output image.

The details of the image are reconstructed through the Resblock convolutional network structure, thereby enhancing the useful information in the image. Specifically, the ResNet50 structure can be used to enhance the details of the first output image, improve the image quality, and help improve the visual effect of the image. It should be noted that other convolution structures may also be used, which are not specifically limited in this embodiment of the present application.

As shown in Figure 5, a specific embodiment will be used to describe the technical solution of the present application, the image to be processed is a video image, and the image processing method includes but is not limited to the following steps:

Step S510: Obtain the image to be processed obtained after decoding the original video image;

Step S520: Obtain coding unit division information of the original image during encoding, the coding unit division information includes first position information and first size information of each coding unit;

Step S530: Divide the image to be processed according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the coding units;

Step S540: Flatten multiple feature blocks into corresponding first feature data to obtain a first feature sequence, wherein the first feature data is represented by a one-dimensional vector, and input the first feature sequence to the Transformer module;

Step S550: multiplying the first feature sequence by a plurality of corresponding first preset matrices to obtain a second feature sequence composed of a plurality of second feature data of the same length;

Step S560: Obtain the second position information and second size information of each feature block during division;

Step S570: according to the second position information and the second size information, establish the correlation between multiple second feature data through the self-attention mechanism of the Transformer module;

Step S580: through residual connection and normalization steps, and then through nonlinear transformation processing to obtain a third feature sequence, wherein the third feature sequence is composed of a plurality of third feature data;

Step S590: Multiplying the third characteristic sequence by a plurality of corresponding second preset matrices to obtain a fourth characteristic sequence composed of a plurality of fourth characteristic data, wherein the fourth characteristic data is represented by a one-dimensional vector;

Step S5100: Restore the fourth feature data into a feature block represented by a two-dimensional vector;

Step S5110: splicing a plurality of feature blocks into a first output image according to the second position information;

Step S5120: Perform detail enhancement processing on the first output image to obtain a second output image.

Referring to FIG. 6 , the embodiment of the second aspect of the present application provides an image processing device, including a division module 110 and a Transformer module 130. The division module 110 is configured to obtain the image to be processed obtained after the original image is decoded and processed, and obtain The coding unit division information of the original image when encoding, the coding unit division information includes the first position information and the first size information of each coding unit, and the image to be processed is divided according to the first position information and the first size information, and multiple A feature block corresponding to the coding unit; the Transformer module 130 is configured to establish a connection between multiple feature blocks through a self-attention mechanism to obtain a first output image corresponding to the original image.

According to the solution provided by the embodiment of the present application, the function of the division module 110 is to obtain the coding unit division information of the original image during encoding. The coding unit division information includes the first position information and the first size information of each coding unit, according to the coding unit division The information divides the image to be processed into multiple feature blocks to make full use of the local coding information, so that the divided feature blocks correspond to the coding units, and then use the self-attention mechanism of the Transformer module 130 to establish the connection between the feature blocks, namely Establish global information, and through the interaction of local information and global information, the difference between adjacent feature blocks can be better removed, making the transition between blocks smoother, so as to better enhance the picture quality of the image processed by encoding and decoding. quality.

It should be noted that for the specific implementation manners and corresponding technical effects of the image processing device in the embodiment of the present application, reference may be made to the specific implementation manners and corresponding technical effects of the above-mentioned image processing method.

In the above image processing device, the image to be processed is divided according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the coding unit, including:

As shown in Fig. 6 and Fig. 7, in the above-mentioned image processing device, a linear mapping module 120 is also included, and the linear mapping module 120 is configured to flatten a plurality of feature blocks into a plurality of first feature data to obtain a first feature sequence And input to the Transformer module 130, wherein the first feature data is represented by a one-dimensional vector.

In the above image processing device, a reconstruction module 140 is also included, and the reconstruction module 140 is configured to perform detail enhancement processing on the first output image to obtain a second output image.

Exemplarily, the division module 110 divides the input image to be processed according to the coding unit division information to obtain a plurality of feature blocks CU ₁ , CU ₂ , ..., CU _n , and then through the linear mapping module 120 each CU ₁ , CU ₂ , ..., CU _n are flattened into a corresponding one-dimensional data sequence [CU _f1 , CU _f2 , ..., CU _fn ], and the above-mentioned one-dimensional data sequence [CU _f1 , CU _f2 , ... , CU _fn ], that is, the first feature sequence is input into the Transformer module 130, the correlation between different one-dimensional data is learned through the self-attention mechanism of the Transformer module 130, and the first output image corresponding to the original image is obtained, and then reconstructed Module 140 performs detail enhancement processing on the first output image to obtain a second output image.

As shown in Fig. 6 and Fig. 7, in the above-mentioned image processing device, Transformer module 130 comprises embedding layer (Embedding) 131, a plurality of encoding blocks (Encoder) 132 and stitching layer (Jigsaw Puzzle) 133, a plurality of encoding blocks 132 They are stacked on top of each other, and N represents the number of stacks. The encoding block 132 includes sequentially adjacent self-attention mechanism layer (Self-attention), summation and normalization layer (Add&Norm), feed-forward network layer (Feed-forward) and summation and normalization layer.

The embedding layer 131 is used to obtain a second feature sequence composed of a plurality of second feature data of the same length according to the first feature data and the first preset matrix; the self-attention mechanism layer is used to establish a plurality of second feature data. The correlation among them; the output data of the self-attention mechanism layer is sequentially processed through the summation and normalization layer, and the third feature sequence is obtained through the nonlinear transformation of the feedforward network layer, wherein the third feature sequence is composed of multiple third feature data, and then input the third feature sequence into the summation and normalization layer for processing, and finally input the output of the coding block 132 to the splicing layer 133; the splicing layer 133 is used to base the third feature data and the second preset matrix , to obtain a fourth feature sequence composed of a plurality of fourth feature data, wherein the fourth feature data is represented by a one-dimensional vector, and the fourth feature data is restored into a feature block represented by a two-dimensional vector, according to the multiple feature blocks A first output image FM _p is obtained.

In the aforementioned Transformer module 130, establishing the correlation between multiple second feature data includes: obtaining the second position information and second size information of each feature block when divided, and according to the second position information and the second size Information, the self-attention mechanism of the Transformer module 130 establishes the correlation between multiple second feature data.

It should be noted that the second position information and the second size information correspond to the first position information and the first size information respectively, and by obtaining the second position information and the second size information of each feature block, the space of the feature block can be clearly defined. Information, through the self-attention mechanism of the Transformer module 130, the correlation between multiple second feature data is established, combined with the second position information and the second size information, to facilitate information interaction between adjacent feature blocks.

In the aforementioned Transformer module 130, obtaining the first output image according to the multiple feature blocks includes: stitching the multiple feature blocks into the first output image according to the second position information. By acquiring the second position information and splicing the feature blocks according to the second position information, the position representation of the feature blocks in the two-dimensional space can be enhanced, which is beneficial to greatly improve the image processing efficiency.

Exemplarily, N=8, input the first feature sequence [CU _f1 , CU _f2 , ..., CU _fn ] to the Transformer module 130, first convert the lengths into d _model lengths through the learnable operation embedding layer 131 , such as taking d _model = 1024, the embedding layer 131 is a series of matrices of len _f1 × d _model , len _f2 × d _model , ..., len _fn × d _model , wherein len _f1 , len _f2 , ..., len _fn corresponds to the length of CU _f1 , CU _f2 , ..., CU _fn one by one, the number of rows of the first preset matrix is the same as the number of columns of the first feature data, so the first feature data CU in the first feature sequence _{Multiplying f1} , CU _f2 , ..., CU _fn with the corresponding first preset matrix respectively can unify the second feature data in the second feature sequence [CU _{em_1} , CU _{em_2} , ..., CU _{em_n} ] Length, the second feature sequence combined with the second position information and the second size information is input to the subsequent encoding block 132, and the information interaction between different second feature data is carried out through the self-attention mechanism layer, and the output data is summed and normalized The third feature sequence [CU _{en_1} , CU _{en_2} , ..., CU _{en_n} ] is obtained by nonlinear transformation through the feed-forward network layer, and then the data is processed through the summation and normalization layer, and finally the coding block The output of 132 is input to the splicing layer 133, and the splicing layer 133 restores the third feature sequence to the original size of each feature block, that is, with a series of d _model ×len _f1, d _model ×len _f2 , ..., d _model ×len _fn The second preset matrix is multiplied to obtain a plurality of fourth feature data with inconsistent lengths to form a one-dimensional data sequence [CU _p1 , CU _p2 , ..., CU _pn ], that is, the fourth feature sequence, and according to The size of the coding unit is 2n×2n, where n=4, 8, 16 or 32, and the characteristics of n=4, 8, 16 or 32 are restored to the feature blocks represented by the two-dimensional vector, and finally the multiple feature blocks are combined into a complete image, and the first output image obtained in this way is FM _p remains the same size as the original image.

It should be noted that the above image processing device can be deployed in the image processing device, and the image processing device can be a mobile terminal such as a smart phone, a tablet computer, a camera, or a device capable of processing image data such as a desktop computer, a robot, or a server.

Referring to FIG. 8 , the embodiment of the third aspect of the present application provides a model training method, the model includes a Transformer module, the model training method includes but not limited to step S610, step S620, step S630, step S640 and step S650:

Step S610: Obtain an image to be processed, which is a training sample in the constructed training set, wherein the image to be processed is obtained by decoding the original image;

Step S620: Obtain coding unit division information of the original image during encoding, where the coding unit division information includes first position information and first size information of each coding unit;

Step S630: Input the image to be processed and the coding unit division information into the model, divide the image to be processed according to the first position information and the first size information, and obtain a plurality of feature blocks corresponding to the coding units;

Step S640: establish the connection between multiple feature blocks through the self-attention mechanism of the Transformer module, and obtain the first output image corresponding to the original image;

Step S650: Train the model according to the first output image, the original image and the objective function to obtain a trained model.

According to the solution provided by the embodiment of the present application, by acquiring the image to be processed, which is the training sample in the constructed training set, and obtaining the coding unit division information of the original image during encoding, the coding unit division information includes the first coding unit of each coding unit First position information and first size information, divide the image to be processed into multiple feature blocks according to the coding unit division information, that is, combine the coding unit division information to extract training blocks, so as to make full use of local coding information, so that the divided feature blocks are consistent with the coding The units correspond, and then use the self-attention mechanism of the Transformer module to establish the connection between the feature blocks, that is, establish the global information, so as to obtain the first output image of the training sample, and train the model according to the first output image and the objective function , to obtain the trained model, through the interaction of local information and global information, the difference between adjacent feature blocks can be better removed, making the transition between blocks smoother, so that the trained model can better enhance the image picture quality.

It should be noted that the objective function is obtained according to the following formula:

loss=||I _recon -I _GT || ₁ , where I _recon is the first output image, I _GT is the Ground Truth image, that is, the marked target image, and || || ₁ means to calculate the L1 norm.

In the process of training the model, the convergence of the objective function curve is continuously trained so that the first output image output by the model is as close as possible to the target image, and the ability of the model to generate the target image is continuously improved.

It should be noted that for different types of image enhancement tasks, corresponding training sets and objective functions can be designed to train the model, so as to obtain models suitable for different image enhancement tasks, for example, based on low-resolution image samples and corresponding high-resolution A training set composed of high-resolution image samples is used to train the model to obtain an image enhancement model that can be applied to super-resolution image enhancement tasks, or based on a training set composed of blurred image samples and corresponding clear image samples, the model is Training, an image augmentation model that can be applied to the image augmentation task of deblurring can be obtained.

It should be noted that the trained model can be deployed on training devices, for example, on mobile terminals such as smartphones, laptops, and cameras, or on devices capable of processing image data such as desktop computers, robots, and servers.

In the above model training method, in step S630, the image to be processed is divided according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the coding units, including:

It should be noted that the image to be processed is divided according to the first position information and the first size information, and the local encoding information is effectively used to obtain a plurality of feature blocks corresponding to the coding units divided by the original image during encoding. The positions of the blocks and each coding unit are in one-to-one correspondence and the size is consistent, that is, the divided feature blocks CU ₁ , CU ₂ , ..., CU _n are consistent with the encoding, and the correlation between these feature blocks is passed through Transformer The middle self-attention mechanism is established, which contains rich global information.

In the above-mentioned model training method, before the connection between multiple feature blocks is established through the self-attention mechanism of the Transformer module in step S640, the following steps are also included:

flattening a plurality of feature blocks into a plurality of first feature data to obtain a first feature sequence, wherein the first feature data is represented by a one-dimensional vector;

Input the first feature sequence to the Transformer module.

In the above-mentioned model training method, the following steps are also included:

performing detail enhancement processing on the first output image to obtain a second output image;

In the above-mentioned model training method, in step S650, the model is trained according to the first output image and the objective function, and the trained model is obtained, including:

The model is trained according to the second output image and the objective function to obtain a trained model.

As shown in Figure 9, a specific embodiment will be used to describe the technical solution of the present application. The model training method includes but is not limited to the following steps:

Step S710: Obtain an image to be processed, which is a training sample in the constructed training set, wherein the image to be processed is obtained by decoding the original image;

Step S720: Obtain coding unit division information of the original image during encoding, where the coding unit division information includes first position information and first size information of each coding unit;

Step S720: Input the image to be processed and the coding unit division information into the model, divide the image to be processed according to the first position information and the first size information, and obtain a plurality of feature blocks corresponding to the coding units;

Step S740: Flatten multiple feature blocks into multiple first feature data to obtain a first feature sequence, wherein the first feature data is represented by a one-dimensional vector, and input the first feature sequence to the Transformer module;

Step S750: establishing a connection between multiple feature blocks through the self-attention mechanism of the Transformer module to obtain a first output image corresponding to the original image;

Step S760: performing detail enhancement processing on the first output image to obtain a second output image;

Step S770: Train the model according to the second output image and the objective function to obtain a trained model.

In the above-mentioned model training method, in step S640, the connection between multiple feature blocks is established through the self-attention mechanism of the Transformer module, including the following steps:

According to the first feature data and the first preset matrix, a second feature sequence consisting of a plurality of second feature data with the same length is obtained;

Establishing a correlation between multiple second feature data through the self-attention mechanism of the Transformer module, and obtaining a third feature sequence through residual connection and transformation processing, wherein the third feature sequence is composed of multiple third feature data;

According to the third feature data and the second preset matrix, a fourth feature sequence composed of a plurality of fourth feature data is obtained, wherein the fourth feature data is represented by a one-dimensional vector;

Restoring the fourth feature data into a feature block represented by a two-dimensional vector;

A first output image is obtained according to the plurality of feature blocks.

It should be noted that, in step S640, through the self-attention mechanism of the Transformer module to establish the specific implementation of the connection between multiple feature blocks and the corresponding technical effects, you can refer to the specific implementation corresponding to Figure 3 in the above-mentioned image processing method methods and corresponding technical effects.

As shown in Figure 10, a specific embodiment will be used to describe the technical solution of the present application. The model training method includes but is not limited to the following steps:

Step S810: Obtain an image to be processed, which is a training sample in the constructed training set, wherein the image to be processed is obtained by decoding the original image;

Step S820: Obtain coding unit division information of the original image during encoding, where the coding unit division information includes first position information and first size information of each coding unit;

Step S830: Input the image to be processed and the coding unit division information into the model, divide the image to be processed according to the first position information and the first size information, and obtain a plurality of feature blocks corresponding to the coding units;

Step S840: Flatten multiple feature blocks into multiple first feature data to obtain a first feature sequence, wherein the first feature data is represented by a one-dimensional vector, and input the first feature sequence to the Transformer module;

Step S850: According to the first feature data and the first preset matrix, obtain a second feature sequence composed of a plurality of second feature data with the same length;

Step S860: Establish the correlation between multiple second feature data through the self-attention mechanism of the Transformer module, and obtain a third feature sequence through residual connection and transformation processing, wherein the third feature sequence is composed of multiple third feature data composition;

Step S870: According to the third feature data and the second preset matrix, a fourth feature sequence composed of a plurality of fourth feature data is obtained, wherein the fourth feature data is represented by a one-dimensional vector;

Step S880: Restore the fourth feature data into a feature block represented by a two-dimensional vector;

Step S890: Obtain the first output image according to the plurality of feature blocks;

Step S8100: performing detail enhancement processing on the first output image to obtain a second output image;

Step S8110: Train the model according to the second output image and the objective function to obtain a trained model.

In the above-mentioned model training method, in step S860, the self-attention mechanism of the Transformer module is used to establish the correlation between a plurality of second feature data, including the following steps:

Obtaining second position information and second size information of each feature block during division;

According to the second position information and the second size information, the self-attention mechanism of the Transformer module is used to establish the correlation between multiple second feature data.

In step S890, the first output image is obtained according to a plurality of feature blocks, including:

It should be noted that the second position information and the second size information correspond to the first position information and the first size information respectively, and by obtaining the second position information and the second size information of each feature block, the space of the feature block can be clearly defined. Information, through the self-attention mechanism of the Transformer module, the correlation between multiple second feature data is established, and the second position information and the second size information are combined to facilitate the information interaction between adjacent feature blocks. By acquiring the second position information and splicing the feature blocks according to the second position information, the position representation of the feature blocks in the two-dimensional space can be enhanced, which is beneficial to greatly improve the image processing efficiency.

As shown in Figure 11, in the above-mentioned model training method, the following steps are also included:

Step S910: Determine whether the trained model meets the standard according to the preset standard, and obtain the test result;

Step S920: If the test result meets the standard, save the parameters of the model and complete the training;

Step S930: If the test result does not meet the standard, continue to train the model.

It should be noted that the preset standard can be used to determine whether the trained model is up to standard. Effective reference data can be provided according to the test results, and whether the model is up to standard can be judged based on network performance. The preset standard can be subjective quality or objective indicators, such as objective indicators Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM) and other indicators can be used. If the test result does not meet the standard, continue training. If the test result meets the standard, save the trained model parameters. Image quality enhancement can be done directly through this model.

It should be noted that for the specific implementation manners and corresponding technical effects of the model training method in the embodiment of the present application, reference may be made to the specific implementation manners and corresponding technical effects of the above-mentioned image processing method.

As shown in Figure 12, the embodiment of the fourth aspect of the present application provides an image processing device, which includes: a memory 1210, a control processor 1220, and a computer program stored in the memory 1210 and operable on the control processor 1220 .

The control processor 1220 and the memory 1210 may be connected through a bus or in other ways.

The non-transitory software programs and instructions required to realize the image processing method of the above-mentioned embodiment are stored in the memory 1210, and when executed by the control processor 1220, the image processing method in the above-mentioned embodiment is executed, for example, the above-described diagram is executed. Method step S110 to method step S140 in 1, method step S210 and method step S220 in Fig. 2, method step S310 to method step S350 in Fig. 3, method step S410 and method step S420 in Fig. 4, method step S420 in Fig. 5 The method step S510 to the method step S5120.

As shown in Figure 13, the embodiment of the fifth aspect of the present application provides a training device, which includes: a memory 1310, a control processor 1320, and a computer program stored on the memory 1310 and operable on the control processor 1320 .

The control processor 1320 and the memory 1310 may be connected through a bus or in other ways.

The non-transitory software programs and instructions required to realize the model training method of the above-mentioned embodiment are stored in the memory 1310, and when executed by the control processor 1320, the model training method in the above-mentioned embodiment is executed, for example, the above-described diagram is executed. Method step S610 to method step S650 in 8, method step S710 to method step S770 in FIG. 9 , method step S810 to method step S8110 in FIG. 10 , method step S910 to method step S930 in FIG. 11 .

The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, the embodiment of the sixth aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions can be used to make the computer execute the image processing method of the above first aspect or The model training method of the third aspect above, for example, execute method step S110 to method step S140 in Fig. 1 described above, method step S210 and method step S220 in Fig. 2 , method step S310 to method step S350 in Fig. 3 , method step S410 and method step S420 in FIG. 4, method step S510 to method step S5120 in FIG. 5, or perform method steps S610 to method step S650 in FIG. Method step S770, method step S810 to method step S8110 in FIG. 10 , method step S910 to method step S930 in FIG. 11 .

Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

The above is a specific description of several implementations of the present application, but the present application is not limited to the above-mentioned embodiments. Those skilled in the art can also make various equivalent deformations or replacements without violating the spirit of the present application. Any modification or substitution is included within the scope defined by the claims of the present application.

Claims

An image processing method, the method comprising:

Acquiring an image to be processed, the image to be processed is obtained by decoding the original image;

Acquire coding unit division information of the original image during encoding, where the coding unit division information includes first position information and first size information of each coding unit;

Divide the image to be processed according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the coding unit;

The connection between the plurality of feature blocks is established through the self-attention mechanism of the Transformer module to obtain a first output image corresponding to the original image.
The image processing method according to claim 1, wherein the image to be processed is divided according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the coding unit ,include:

Divide the image to be processed into a plurality of feature blocks according to the first position information and the first size information, so that the feature blocks have the same position and size as the coding unit divided by the original image during encoding .
The image processing method according to claim 1, wherein, before the self-attention mechanism of the Transformer module is used to establish the connection between a plurality of the feature blocks, it also includes:

flattening a plurality of the feature blocks into a plurality of first feature data to obtain a first feature sequence, wherein the first feature data is represented by a one-dimensional vector;

Inputting the first feature sequence to the Transformer module.
The image processing method according to claim 3, wherein said establishing a connection between a plurality of said feature blocks through the self-attention mechanism of the Transformer module comprises:

According to the first feature data and the first preset matrix, a second feature sequence consisting of a plurality of second feature data with the same length is obtained;

Through the self-attention mechanism of the Transformer module, the correlation between multiple second feature data is established, and the third feature sequence is obtained through residual connection and transformation processing, wherein the third feature sequence is composed of multiple third features data composition;

According to the third feature data and the second preset matrix, a fourth feature sequence composed of a plurality of fourth feature data is obtained, wherein the fourth feature data is represented by a one-dimensional vector;

Restoring the fourth feature data into a feature block represented by a two-dimensional vector;

The first output image is obtained according to the plurality of feature blocks.
The image processing method according to claim 4, wherein the self-attention mechanism of the Transformer module establishes a plurality of correlations between the second feature data, comprising:

Obtaining second position information and second size information of each of the feature blocks when they are divided;

According to the second position information and the second size information, establish a correlation between a plurality of the second feature data through the self-attention mechanism of the Transformer module;

The obtaining the first output image according to the plurality of feature blocks includes:

Stitching the plurality of feature blocks into the first output image according to the second position information.
The image processing method according to claim 1, further comprising: performing detail enhancement processing on the first output image to obtain a second output image.
An image processing device, comprising:

A division module, configured to acquire an image to be processed obtained by decoding an original image, and acquire coding unit division information of the original image during encoding, where the coding unit division information includes first position information of each coding unit and first size information, and divide the image to be processed according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the coding unit;

The Transformer module is configured to establish a connection between the plurality of feature blocks through a self-attention mechanism to obtain a first output image corresponding to the original image.
The image processing device according to claim 7, wherein the image to be processed is divided according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the coding unit ,include:

Divide the image to be processed into a plurality of feature blocks according to the first position information and the first size information, so that the feature blocks have the same position and size as the coding unit divided by the original image during encoding .
The image processing device according to claim 7, further comprising a linear mapping module configured to flatten a plurality of feature blocks into a plurality of first feature data, obtain a first feature sequence and input it to the The Transformer module, wherein the first feature data is represented by a one-dimensional vector.
The image processing device according to claim 7, further comprising a reconstruction module configured to perform detail enhancement processing on the first output image to obtain a second output image.
A method for model training, wherein the model includes a Transformer module, the method comprising:

Obtaining an image to be processed, the image to be processed is a training sample in the constructed training set, wherein the image to be processed is obtained by decoding the original image;

Acquire coding unit division information of the original image during encoding, where the coding unit division information includes first position information and first size information of each coding unit;

input the image to be processed and the coding unit division information into the model, divide the image to be processed according to the first position information and the first size information, and obtain multiple The corresponding feature block;

Establishing a connection between multiple feature blocks through the self-attention mechanism of the Transformer module to obtain a first output image corresponding to the original image;

The model is trained according to the first output image and the objective function to obtain a trained model.
The model training method according to claim 11, wherein the image to be processed is divided according to the first position information and the first size information to obtain a plurality of feature blocks corresponding to the coding units ,include:

Divide the image to be processed into a plurality of feature blocks according to the first position information and the first size information, so that the feature blocks have the same position and size as the coding unit divided by the original image during encoding .
The model training method according to claim 11, wherein, before the self-attention mechanism of the Transformer module is used to establish the connection between a plurality of the feature blocks, it also includes:

flattening a plurality of the feature blocks into a plurality of first feature data to obtain a first feature sequence, wherein the first feature data is represented by a one-dimensional vector;

Inputting the first feature sequence to the Transformer module.
The model training method according to claim 11, further comprising: performing detail enhancement processing on the first output image to obtain a second output image;

The training of the model according to the first output image and the objective function to obtain a trained model includes:

The model is trained according to the second output image and the objective function to obtain a trained model.
The model training method according to claim 11, further comprising:

Determine whether the trained model meets the standard according to the preset standard, and obtain the test result;

If the test result reaches the standard, then save the parameters of the model and complete the training;

If the test result does not meet the standard, continue to train the model.
An image processing device, comprising at least one control processor and a memory for communicating with the at least one control processor; the memory stores instructions executable by the at least one control processor, and the instructions are The at least one control processor executes so that the at least one control processor can execute the image processing method according to any one of claims 1 to 6.
A training device comprising at least one control processor and a memory for communicating with the at least one control processor; the memory stores instructions executable by the at least one control processor, the instructions being executed by the at least one control processor The at least one control processor is executed, so that the at least one control processor can execute the model training method according to any one of claims 11 to 15.
A computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute the image processing method according to any one of claims 1 to 6 Or the model training method as described in any one of claims 11 to 15.