CN111382845B

CN111382845B - Template reconstruction method based on self-attention mechanism

Info

Publication number: CN111382845B
Application number: CN202010171371.1A
Authority: CN
Inventors: 贾可; 郭昕; 袁超; 何尚杰; 刘海龙
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2022-09-02
Anticipated expiration: 2040-03-12
Also published as: CN111382845A

Abstract

The invention provides a template reconstruction method based on a self-attention mechanism, which comprises the steps of firstly carrying out down-sampling on an input image layer by layer to obtain a down-sampling image, carrying out feature extraction on the down-sampling image, then respectively carrying out table look-up mapping on the features of each layer by adopting the self-attention mechanism to obtain a residual error corresponding to each layer, finally carrying out up-sampling on the down-sampling image layer by layer, simultaneously fusing the residual error information of each layer to generate an up-sampling image corresponding to each layer, and finally generating an up-sampling image which is a reconstructed template image.

Description

Template reconstruction method based on self-attention mechanism

Technical Field

The invention belongs to the field of machine vision detection, and particularly relates to a template reconstruction method based on a self-attention mechanism.

Background

The defect detection method based on template comparison is widely applied to the field of machine vision detection due to the advantages of high speed, high precision and the like. In the defect detection process based on template comparison, in order to improve the detection rate and reduce the false detection rate, the template reconstruction is usually required to be performed for preprocessing operation, so that the individuation difference of the object to be detected is eliminated. In general, template reconstruction can be divided into a rigid transformation method and a non-rigid transformation method, and the non-rigid transformation method has been receiving more and more attention in recent years due to its wider application scenarios.

The non-rigid transformation template reconstruction method can be divided into a method based on image registration and a method based on characteristic reconstruction according to different reconstruction principles, and the invention belongs to the non-rigid template reconstruction method based on the characteristic reconstruction. In this technical field, conventional methods include a method based on Kernel Principal Component Analysis (KPCA), and the like, and with the recent popularization and application of deep learning techniques, the method based on deep learning has started to make its way, wherein the application of an image restoration technique based on deep self-encoding (AutoEncoder) to template reconstruction is mainly represented.

In the template reconstruction method using the depth self-coding as the core technology, a Convolutional Neural Network (CNN) is utilized to construct two parts, namely an encoding network (encoder) for down-sampling of an image and a decoding network (decoder) for up-sampling of features, low-dimensional features of the input image are extracted through the encoding network, the reconstruction process from the features to the image is completed through the decoding network, the constrained reconstructed image and the input image are approximately the same in content, individualized differences such as displacement, chromatic aberration and defects are eliminated, and the reconstructed image is compared with the input image, so that defect content possibly contained in the input image is detected.

The above prior art has the following disadvantages:

1. in a deep self-coding network, low-dimensional features extracted through a coding network generally cannot contain all detail features beneficial to reconstruction, so that reconstructed images are blurred to different degrees, and detection false alarm is caused. Although means such as a generative countermeasure network (GAN) have been proposed in recent years to alleviate the problem of blurring, the principle is based on statistical information from other training images, rather than the image to be reconstructed itself, and therefore the false alarm problem cannot be completely solved.

2. For a decoding network, feature upsampling is usually performed by using a transposed-conv/deconv (transposed-conv/deconv) operation, and since the transposed convolution is a pathological problem, a checkerboard artifact is often generated on a final reconstruction result, thereby affecting a detection result. Although methods such as sub-pixel convolution (sub conv) have been proposed to alleviate this phenomenon, their effect cannot be eradicated.

3. Due to the fact that the interpretability of the convolutional neural network is poor, feedback from observation of reconstruction effects to design of a deep self-coding network is difficult, selection of hyper-parameters such as low-dimensional characteristic dimensions and the number of convolutional kernels is usually set based on experience, optimal balance on effects and efficiency is difficult to obtain, and migration of a reconstruction method between different scene problems is hindered.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a template reconstruction method based on a self-attention mechanism, which carries out table look-up mapping through the self-attention mechanism, calculates and obtains a residual error according to the characteristics extracted from an image, and realizes the reconstruction of a template image through the residual error. The invention reduces the problem of false detection, avoids the checkerboard artifact phenomenon and enables the observation of the reconstructed result to be more visual and controllable.

The specific implementation content of the invention is as follows:

a template reconstruction method based on a self-attention mechanism comprises the steps of firstly carrying out down-sampling on an input image layer by layer to obtain a down-sampling image, carrying out feature extraction on the down-sampling image, then respectively carrying out table look-up mapping on features of each layer by adopting the self-attention mechanism to obtain residual errors corresponding to each layer, finally carrying out up-sampling on the down-sampling image layer by layer, simultaneously fusing residual error information of each layer to generate an up-sampling image corresponding to each layer, and finally generating an up-sampling image which is a reconstructed template image.

In order to better implement the present invention, the specific steps of "performing table look-up mapping on the features of each level by using an attention mechanism to obtain a residual error corresponding to each level" further include:

s1, defining each level of characteristics of the input image as a three-dimensional tensor with a c dimension, wherein each space position in each level of characteristics is a vector with the length of c;

s2, defining a two-dimensional tensor query table with the size of c x n, and cross-multiplying any one spatial position of each level of characteristics with the two-dimensional tensor query table with the size of c x n to obtain a vector with the length of n;

s3, performing softmax normalization processing on the vector with the length of n obtained in the step S2 to obtain another vector with the length of n, wherein n elements contained in the obtained another vector with the length of n are all larger than or equal to 0, and the sum of the n elements is equal to 1; the n elements are attention weights;

s4, carrying out weighted summation on n codebook data with the same space size by using the attention weight obtained in the step S3 to obtain a residual error of a corresponding hierarchy of the codebook data; the residual error comprises high-frequency detail information which is missed from an up-sampling image of the next stage after up-sampling.

In order to better implement the present invention, further, the two-dimensional tensor look-up table and the codebook data are both automatically learned in the network training as network parameters.

In order to better implement the present invention, the step of "down-sampling the input image layer by layer to obtain a down-sampled image, and extracting features of the down-sampled image" specifically includes: performing down-sampling calculation on an input image for h times to obtain an h-layer down-sampled image; then, carrying out feature extraction on the input image and the downsampling image except the h layer; the feature A extracted from the input image is the feature of the level corresponding to the input image; and the features of the corresponding levels of the down-sampling images except the h-th level are as follows: and fusing the features extracted from the downsampled image of the current level with the features of the previous level.

To better implement the present invention, further, the down-sampling calculation includes nearest neighbor interpolation, bilinear interpolation, or bicubic interpolation calculation.

In order to better implement the present invention, further, the specific generating method of the reconstructed template image is as follows: carrying out up-sampling calculation level by level from the down-sampling image of the h layer; after the down-sampling image of the h layer is subjected to up-sampling calculation, the down-sampling image of the h layer and the residual error of the h-1 layer are added pixel by pixel to generate an up-sampling image of the h-1 layer; similarly, the upsampled image of each level from the h-1 th level to the top level is subjected to upsampling calculation and then is added with the residual error of the previous level pixel by pixel to generate the upsampled image of the previous level, and the upsampled image subjected to the upsampling calculation and the residual error addition for h times is the reconstructed template image and corresponds to the level of the input image.

Compared with the prior art, the method has the following advantages and beneficial effects:

1. in the process of utilizing the characteristics to calculate the residual error, the invention adopts a mode of combining an attention mechanism with a codebook to finish the calculation, thereby avoiding the ill-conditioned problem of solving the transposed convolution and further avoiding the occurrence of checkerboard artifact phenomenon;

2. high-frequency information contained in the codebook is easy to visualize, the physical significance of the number n of the codebook is very clear and easy to adjust, and the dimensionality c of the lookup table is obviously and positively related to n, so that the network structure, namely the hyper-parameter setting is more controllable, and visual adjustment can be conveniently carried out through observation of a reconstruction result; in general, the dimension super parameters such as c and n can achieve a good reconstruction effect when the value is small, such as less than 64, compared with the dimension value which is hundreds of times in the traditional reconstruction model, the network parameters and the calculated amount are remarkably reduced, the model calculation is rapid, and the model is easy to deploy on low-cost hardware with limited calculation capability;

3. the template is reconstructed by adopting a mode of up-sampling step by step, and a mode of residual correction is adopted in the reconstruction process, namely, low-frequency information of the image after reduction and amplification is utilized, so that the model only needs to estimate the residual, namely high-frequency detail information, instead of the traditional mode of directly estimating the image by using the model, the high-frequency and low-frequency information need to be estimated simultaneously, therefore, the problem is simplified, the model effect is better, the definition of the reconstructed template is higher, and the probability of false alarm detection is further reduced.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

fig. 2 is a block diagram of a template reconstruction system according to the present invention.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and therefore should not be considered as a limitation to the scope of protection. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Example 1:

Example 2:

based on the above embodiment 1, as shown in fig. 1, taking a network subjected to four times of down-sampling as an example, the template reconstruction network integrally presents a U-shaped structure: wherein, starting from the input image, the down-sampling is performed step by step in the left wing, and starting from the minimum down-sampling image, the up-sampling is performed step by step in the right wing; in the left wing down-sampling process, a thin arrow pointing downwards represents simple image down-sampling calculation, such as nearest neighbor calculation, bilinear calculation, bicubic interpolation calculation and the like; in the process of sampling on the right wing, a thin arrow with an upward direction represents image up-sampling calculation; as shown in fig. 1, the input image is subjected to a first downsampling calculation to generate 1/2 downsampled images, the 1/2 downsampled image is subjected to a second downsampling calculation to generate 1/4 downsampled images, the 1/4 downsampled image is subjected to a third downsampling calculation to generate 1/8 downsampled images, and finally the 1/8 downsampled image is subjected to a fourth downsampling calculation to generate 1/16 downsampled images;

in addition, feature extraction is performed on each level from the input image to 1/8 down-sampling images, as shown in fig. 1, the feature extracted from the input image is feature a, the feature of each level from the input image downwards is the fusion of the feature extracted from the down-sampling image of the level and the feature of the previous level, for example, feature B of the level where 1/2 down-sampling images are located is the fusion of feature a and the feature extracted from the 1/2 down-sampling image;

after 1/16 downsampled images and the characteristics of each hierarchy are obtained, residual errors corresponding to the characteristics of each hierarchy need to be calculated through an attention mechanism, the 1/16 downsampled images are subjected to four times of upsampling, and the residual errors of each hierarchy are combined to finally generate reconstructed template images; specifically, the operation of generating the reconstructed template image is as follows: carrying out up-sampling calculation on 1/16 down-sampled images, and simultaneously adding residuals D of corresponding levels of 1/8 down-sampled images pixel by pixel to generate 1/8 up-sampled images; performing upsampling calculation on 1/8 upsampled images, and simultaneously adding residuals C of corresponding levels of 1/4 downsampled images pixel by pixel to generate 1/4 upsampled images; in the same way, a reconstructed image can be generated through two operations;

the specific operation of calculating the residual error of each level through the self-attention mechanism is as follows:

1. defining each level of feature as a three-dimensional tensor with a certain typical dimension c, such as 8 or 16, and the like, that is, each spatial position in the feature is a vector with the length c;

2. a two-dimensional tensor query table with the size of c x n is arranged, the feature of each space position is cross-multiplied with the two-dimensional tensor query table to obtain a vector with the length of n, and the operation can be efficiently realized in a network through convolution operation with the size of 1x 1;

3. performing softmax normalization processing on the vector with the length n obtained in the step 2 to obtain another vector with the length n, wherein the sum of n elements is 1 and is greater than or equal to 0, and the vector is called attention weight;

4. carrying out weighted summation on n pieces of codebook data with the same spatial size by using the n-dimensional attention weight obtained in the step 3 to obtain a residual error for the level reconstruction; the residual mainly contains high-frequency detail information which is lost after the up-sampling calculation of the reconstructed image from the next stage.

Other parts of this embodiment are the same as those of embodiment 1, and thus are not described again.

Example 3:

based on the method, as shown in fig. 2, the apparatus includes a training module and a detection module connected to each other; the training module comprises a training data set and a reconstruction model module which are connected in sequence; the detection module comprises a sensor, an input image module and a differential detection module which are connected in sequence; the detection module also comprises a reconstruction template module connected with the input image module and the seal checking detection module; the training module and the detection module are connected through a reconstruction model module and a reconstruction template module;

the training module is an external server and is used for obtaining a reconstruction model through a series of training set learning. The detection module is a main part of the device, thin arrows in the figure show simple data transmission and calculation, and hollow thick arrows show calculation performed by using a convolutional neural network; the main workflow of the detection module is as follows:

1. acquiring an image of an object to be detected by using a sensor such as a camera, a camera and the like to obtain an input image;

2. the main principle and the flow for obtaining the reconstruction model by using the input image and the reconstruction model obtained by the training module are as described in embodiments 1 and 2, so that the description is omitted, and the reconstruction image, namely the template, is obtained by performing inference calculation on the obtained reconstruction model through a convolutional neural network;

3. and carrying out difference on the input image and the reconstruction template, taking an absolute value, carrying out appropriate post-processing such as high-low threshold filtering, connected domain analysis and the like, and finally obtaining a detection result and finishing output.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modifications and equivalent variations of the above embodiment according to the technical spirit of the present invention are within the scope of the present invention.

Claims

1. A template reconstruction method based on a self-attention mechanism is characterized in that firstly, down-sampling is carried out on an input image layer by layer to obtain a down-sampling image, feature extraction is carried out on the down-sampling image, then table look-up mapping is carried out on the features of each layer by adopting the self-attention mechanism to obtain residual errors corresponding to each layer, finally, the down-sampling image is up-sampled level by level, the residual error information of each layer is fused to generate an up-sampling image corresponding to each layer, and finally, the generated up-sampling image is a reconstructed template image;

the specific steps of performing table look-up mapping on the characteristics of each level by adopting an attention mechanism to obtain the residual error corresponding to each level are as follows:

s4, carrying out weighted summation on n pieces of codebook data with the same space size by using the attention weight obtained in the step S3 to obtain a residual error of a corresponding level of the codebook data; the residual error comprises high-frequency detail information which is lost after the upsampling of an upsampled image of the next stage;

the step of performing layer-by-layer level down-sampling on the input image to obtain a down-sampled image and performing feature extraction on the down-sampled image specifically includes: performing down-sampling calculation on an input image for h times to obtain an h-layer down-sampled image; then, performing feature extraction on the input image and the downsampled images except for the h layer; the feature A extracted from the input image is the feature of the level corresponding to the input image; and the features of the corresponding levels of the down-sampling images except the h-th level are as follows: and fusing the features extracted from the downsampled image of the current level with the features of the previous level.

2. The template reconstruction method based on the self-attention mechanism as claimed in claim 1, wherein the two-dimensional tensor look-up table and the codebook data are both automatically learned in network training as network parameters.

3. The self-attention mechanism-based template reconstruction method of claim 1, wherein the downsampling calculation comprises nearest neighbor interpolation, bilinear interpolation or bicubic interpolation calculation.

4. The template reconstruction method based on the self-attention mechanism as claimed in claim 1, wherein the specific generation method of the reconstructed template image is as follows: performing up-sampling calculation level by level from the down-sampling image of the h layer; after the down-sampling image of the h layer is subjected to up-sampling calculation, the down-sampling image of the h layer and the residual error of the h-1 layer are added pixel by pixel to generate an up-sampling image of the h-1 layer; similarly, the upsampled image of each level from the h-1 th level to the top level is subjected to upsampling calculation and then is added with the residual error of the previous level pixel by pixel to generate the upsampled image of the previous level, and the upsampled image subjected to the upsampling calculation and the residual error addition for h times is the reconstructed template image and corresponds to the level of the input image.