CN115578565B

CN115578565B - Attention scale perception guided lightweight U-net method, device and storage medium

Info

Publication number: CN115578565B
Application number: CN202211394805.XA
Authority: CN
Inventors: 周展; 李朋超; 蔡丽蓉
Original assignee: Beijing Jushi Intelligent Technology Co ltd
Current assignee: Beijing Jushi Intelligent Technology Co ltd
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2023-04-14
Anticipated expiration: 2042-11-09
Also published as: CN115578565A

Abstract

The invention relates to a light U-net method, a device and a storage medium for attention scale perception guidance, which are applied to the technical field of workpiece surface defect region segmentation and comprise the following steps: acquiring an average pooling result and a maximum pooling result of the deepest layer feature F through an attention sensing module, and respectively applying 1 × 1 convolutional layers shared by weights to the average pooling result and the maximum pooling result; generating attention A based on the obtained features F1 and F2, generating a feature Fscale after scaling operation, finally carrying out element summation operation between the feature Fscale and the input deepest feature F through a learnable parameter a, and outputting the feature F _SSAM (ii) a According to the scheme, the attention scale perception module is used, the scale features of the multi-defect target are learned and judged through the attention mechanism capable of perceiving the scale, the discriminant features of the defect target are effectively focused, the interference of a complex background is restrained, and the problem that features such as background textures are easily confused with the defects is effectively avoided.

Description

Attention scale perception guided lightweight U-net method, device and storage medium

Technical Field

The invention relates to the technical field of workpiece surface defect region segmentation, in particular to a light U-net method, a device and a storage medium guided by attention scale perception.

Background

In recent years, high-precision workpiece surface defect region segmentation based on a deep learning algorithm is rapidly developed. The current representative method adopts a coder-decoder framework such as U-Net or a DeeplabV3 method, and realizes effective fusion of multi-scale features by fusing multi-level features such as bottom-layer spatial detail and high-layer discrimination semantics of an image, or aggregates context information in different distance ranges by expansion convolution pyramids in different receptive field ranges, so as to realize prediction of a defect region;

however, as industrial practical application scenes are complex and changeable, defect forms are varied, and features such as background textures and the like are easily confused with defects, the problems of large intra-class difference and small inter-class difference are formed, in the process of dividing a defect region, representative methods such as U-Net and the like only focus on fusion of multi-scale features of different levels, fusion of context information of different receptive fields or a focus mechanism of space and channel dimensions, and complex background interference is difficult to effectively suppress.

Disclosure of Invention

In view of this, an object of the present invention is to provide a method, an apparatus, and a storage medium for attention-scale-aware-guided lightweight U-net, so as to solve the problem in the prior art that, in the defect region segmentation process, only multi-scale features of different levels are fused, context information of different receptive fields is fused, or an attention mechanism of space and channel dimensions is focused, but complex background interference is ignored, so that features such as background texture are easily confused with defects.

According to a first aspect of embodiments of the present invention, there is provided an attention-scale-aware-guided lightweight U-net method, comprising:

inputting an image to be segmented into a segmentation network, obtaining a plurality of layers of feature maps with different levels after multilayer convolution and pooling operation, and selecting a deepest feature F;

inputting the deepest layer features F into an attention scale perception module for obtaining an average pooling result and a maximum pooling result of the deepest layer features F;

the attention sensing module respectively applies the 1 × 1 convolutional layer shared by the weights to the average pooling result and the maximum pooling result to obtain a feature F1 and a feature F2;

generating attention A emphasizing corresponding features in the deepest features F based on the features F1 and the features F2;

generating a scaled feature Fscale based on the feature F1, the feature F2, and the attention a;

element summation operation is carried out between the feature Fscale and the input deepest feature F through the learnable parameter a to obtain the feature F _SSAM ；

Mutual injectionFeature F output by intention scale perception module _SSAM Executing up-sampling operation, and fusing the sampling result with the multilayer feature graphs of different levels to obtain a feature X _m The feature X _m And outputting a segmentation result of the defect area of the image to be segmented after one-layer convolution.

Preferably, the first and second electrodes are formed of a metal,

inputting the deepest layer features F into the attention scale perception module, wherein the step of obtaining an average pooling result and a maximum pooling result of the deepest layer features F comprises the following steps:

inputting the deepest layer features F into an attention scale perception module;

and the attention scale perception module carries out aggregation processing on the deepest layer characteristic F through parallel maximum pooling operation and average pooling operation respectively to obtain an average pooling result and a maximum pooling result of the deepest layer characteristic F.

Preferably, the first and second electrodes are formed of a metal,

the outputting of the segmentation result of the defective region of the image to be segmented includes:

feature F output by attention scale perception module _SSAM Performing an upsampling operation;

carrying out U-net type channel dimension splicing fusion on the sampling result and the upper layer feature diagram of the deepest feature F, and carrying out convolution operation and activation operation on the fusion result to obtain a feature X1;

then, the characteristic X1 is subjected to up-sampling operation, the steps are repeated until the U-net type channel dimensionality splicing fusion is carried out on the characteristic X and the shallowest layer characteristic graph, and the fusion result is subjected to convolution operation and activation operation to obtain the characteristic X _m ；

Will be characterized by X _m And outputting the segmentation result of the defect area of the image to be segmented after one layer of convolution.

Preferably, the first and second liquid crystal display panels are,

the generating scaled features Fscale includes:

an element multiplication operation is performed between attention a, feature F1 and feature F2, respectively, for generating a scaled feature Fscale.

Preferably, the first and second electrodes are formed of a metal,

the generating of the attention A emphasizing the corresponding feature in the deepest feature F based on the feature F1 and the feature F2 comprises:

based on the features F1 and F2, attention a emphasizing the corresponding feature in the deepest feature F is generated by the softmax function.

Preferably, the first and second electrodes are formed of a metal,

inputting the image to be segmented into a segmentation network, obtaining a plurality of layers of feature maps with different levels after multilayer convolution and pooling operation, wherein the step of selecting the deepest feature F comprises the following steps:

inputting an image to be segmented into a segmentation network, and obtaining a characteristic M after convolution and pooling _N Said feature M _N The feature map of the shallowest layer is obtained;

for feature M _N After convolution and pooling operation, the characteristic M is obtained _N-1 ；

And repeating the steps, and obtaining the deepest layer characteristic F after convolution and pooling operation for preset times.

According to a second aspect of embodiments of the present invention, there is provided an attention-scale perceptually-guided lightweight U-net device, the device comprising:

a characteristic diagram acquisition module: the method comprises the steps that images to be segmented are input into a segmentation network, after multilayer convolution and pooling operation, multilayer feature maps with different levels are obtained, and the deepest feature F is selected;

an input module: the system comprises an attention scale sensing module, a clustering module and a storage module, wherein the attention scale sensing module is used for inputting the deepest layer characteristics F into the attention scale sensing module and obtaining an average pooling result and a maximum pooling result of the deepest layer characteristics F;

a convolution application module: the attention sensing module is used for respectively applying the 1 × 1 convolutional layer shared by the weights to the average pooling result and the maximum pooling result to obtain a feature F1 and a feature F2;

the attention map generation module: generating attention A for emphasizing corresponding features in the deepest features F based on the features F1 and the features F2;

a scaling module: for generating a scaled feature Fscale based on feature F1, feature F2 and attention a;

an element summation module: the method is used for carrying out element summation operation on the feature Fscale and the input deepest feature F through the learnable parameter a to obtain the feature F _SSAM ；

An output module: feature F for sensing module output for attention scale _SSAM Executing up-sampling operation, and fusing the sampling result with the multilayer feature graphs of different levels to obtain a feature X _m The feature X _m And outputting the segmentation result of the defect area of the image to be segmented after one layer of convolution.

According to a third aspect of embodiments of the present invention, there is provided a storage medium storing a computer program which, when executed by a processor, implements each step in the attention metric perception guided lightweight U-net method as described in any one of the above.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

according to the method, an average pooling result and a maximum pooling result of the deepest feature F are obtained through an attention sensing module, and then 1 × 1 convolutional layers shared by weights are applied to the average pooling result and the maximum pooling result respectively through the attention sensing module to obtain a feature F1 and a feature F2; generating attention A emphasizing corresponding features in the deepest features F based on the features F1 and the features F2, generating features Fscale after scaling operation, and finally performing element summation operation between the features Fscale and the input deepest features F through learnable parameters a by an attention sensing module to output the features F _SSAM (ii) a According to the scheme, the attention scale sensing module is used, the scale characteristics of the multi-defect target are learned and judged through the attention mechanism capable of sensing scales, the remote dependency relationship can be effectively captured with low calculation cost, the distinguishing characteristics of the defect target are effectively focused, the interference of a complex background is inhibited, and the problem that the characteristics such as background textures are easily confused with the defects is effectively avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating an attention metric perceptually-guided lightweight U-net method according to an exemplary embodiment;

FIG. 2 is a schematic diagram of an overall flow process shown in accordance with another exemplary embodiment;

FIG. 3 is a schematic diagram of a scaling process shown in accordance with another exemplary embodiment;

FIG. 4 is a system diagram illustrating an attention-scale perceptually-guided lightweight U-net device, according to another exemplary embodiment;

in the drawings: the method comprises the following steps of 1-a characteristic diagram acquisition module, 2-an input module, 3-a convolution application module, 4-an attention diagram generation module, 5-a scaling module, 6-an element summation module and 7-an output module.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Example one

Fig. 1 is a flowchart illustrating an attention metric perception guided lightweight U-net method according to an exemplary embodiment, as shown in fig. 1, including:

s1, inputting an image to be segmented into a segmentation network, obtaining a plurality of layers of feature maps with different levels after multilayer convolution and pooling operation, and selecting a deepest feature F;

s2, inputting the deepest layer characteristics F into an attention scale sensing module for obtaining an average pooling result and a maximum pooling result of the deepest layer characteristics F;

s3, the attention sensing module respectively applies the 1 × 1 convolutional layers shared by the weights to the average pooling result and the maximum pooling result to obtain a feature F1 and a feature F2;

s4, generating attention A emphasizing corresponding features in the deepest features F based on the features F1 and the features F2;

s5, generating a scaled feature Fscale based on the feature F1, the feature F2 and the attention A;

s6, element summation operation is carried out between the characteristic Fscale and the input deepest characteristic F through the learnable parameter a, and the characteristic F is obtained _SSAM ；

S7, outputting the characteristics F of the attention scale perception module _SSAM Executing up-sampling operation, and fusing the sampling result with multi-layer characteristic graphs of different levels to obtain a characteristic X _m The feature X _m Outputting a segmentation result of a defect area of the image to be segmented after one layer of convolution;

it can be understood that the scheme provides a lightweight U-net based on attention scale perception guidance of an encoder-decoder structure, and a lightweight ResNet18 is adopted as an encoder to generate feature maps of different levels from different stages of convolution; the model structure is shown in fig. 2, wherein the bottommost feature map is processed by the attention scale perception module and then input into the decoder;

image to be segmented

Inputting the characteristic graphs into a segmentation network, obtaining multilayer characteristic graphs with different levels after multilayer convolution and pooling operations, wherein the deepest characteristic graph is F, obtaining an average pooling result and a maximum pooling result of the deepest characteristic F through an attention sensing module, and respectively applying 1 multiplied by 1 convolution layers shared by weight to the average pooling result and the maximum pooling result through the attention sensing module to obtain a characteristic F1 and a characteristic F2; generating attention A emphasizing corresponding features in the deepest features F based on the features F1 and the features F2, generating features Fscale after zooming operation, and finally enabling the attention sensing module to learn through learningThe parameter a carries out element summation operation between the feature Fscale and the input deepest feature F, and the feature F is output _SSAM (ii) a Features F of decoder to attention sensing module output _SSAM Carrying out up-sampling operation, and fusing the sampling result with multi-layer characteristic graphs of different levels to obtain a characteristic X _m The feature X _m Outputting a segmentation result of a defect area of the image to be segmented after one layer of convolution; according to the scheme, the attention scale sensing module is used, the scale characteristics of the multi-defect target are learned and judged through the attention mechanism capable of sensing scales, the remote dependency relationship can be effectively captured with low calculation cost, the distinguishing characteristics of the defect target are effectively focused, the interference of a complex background is inhibited, and the problem that the characteristics such as background textures are easily confused with the defects is effectively avoided.

Preferably, the first and second electrodes are formed of a metal,

the attention scale perception module carries out aggregation processing on the deepest layer characteristics F through parallel maximum pooling operation and average pooling operation to obtain average pooling results and maximum pooling results of the deepest layer characteristics F;

it will be appreciated that as shown in FIG. 3, the feature F is processed by the attention scale perception module, which will first use parallel max-pooling or average-pooling operations

Is polymerized to

、/>

Whereby highly relevant context information is extracted from each line of the feature F, operating->

And &>

As shown in

formulas

1 and 2, respectively:

（1）

（2）

in the formula (I), the compound is shown in the specification,

represents a maximum pooling or average pooling operation;

preferably, the first and second electrodes are formed of a metal,

the outputting of the segmentation result of the defective region of the image to be segmented comprises:

carrying out U-net type channel dimension splicing fusion on the sampling result and a feature map on the upper layer of the deepest feature F, and carrying out convolution operation and activation operation on the fusion result to obtain a feature X1;

Will be characterized by X _m Outputting a segmentation result of a defect area of the image to be segmented after one layer of convolution;

it will be appreciated that the feature F output by the attention scale sensing module is shown in FIG. 2 _SSAM Executing up-sampling operation, then carrying out U-net type channel dimension splicing and fusion with M1, then carrying out convolution and activation on the fusion result to obtain characteristic X1, and similarly carrying out up-sampling on X1, andand fusing the characteristics M2, convolving the fusion result, activating to obtain the characteristics M2, sequentially obtaining X3 and X4 in the same way, and finally outputting the segmentation result of the defect area after the X4 is convolved by one layer.

Preferably, the first and second liquid crystal display panels are,

generating attention A emphasizing corresponding features in the deepest features F through a softmax function on the basis of the features F1 and the features F2;

it can be appreciated that the attention metric perception module applies the 1 × 1 convolutional layer for weight sharing to the 1 × 1 convolutional layer

And

in the above (equation 3 and equation 4), the feature F1 and the feature F2 are obtained, the height context information of each feature in F is transferred, and an attention map (equation 5) is generated by using a softmax function, the importance of the corresponding feature in F is emphasized, the attention map can dynamically select a proper scale feature, and the features of different scales are fused through self-learning:

（3）

（4）

（5）

in the formula (I), the compound is shown in the specification,

representing the convolution operation of weight sharing.

Preferably, the first and second liquid crystal display panels are,

the generating scaled features Fscale includes:

performing element multiplication operations among the attention A, the feature F1 and the feature F2 respectively for generating a scaled feature Fscale;

it will be appreciated that attention is directed to A and F, respectively ₁ 、F ₂ Performs element multiplication operation to generate scaled feature Fscale (equation 6), and finally, utilizes learnable parameters

To F _scale And performing element summation operation between the input characteristic F and the output characteristic F to obtain the final output characteristic F _SSAM (formula 7):

（6）

（7）

preferably, the first and second electrodes are formed of a metal,

Repeating the steps, and obtaining the deepest layer characteristic F after convolution and pooling operation for preset times;

it will be appreciated that the images are shown in FIG. 2

Inputting the segmentation network, and obtaining a plurality of characteristics with different levels after multi-layer convolution and poolingM4, M3, M2, M1 and F, wherein F is the deepest layer, and M4 is the shallowest layer.

Example two

The embodiment also discloses a system schematic diagram of a lightweight attention scale perception guided U-net apparatus, as shown in fig. 4, including:

the characteristic diagram obtaining module 1: the method comprises the steps that images to be segmented are input into a segmentation network, after multilayer convolution and pooling operation, multilayer feature maps with different levels are obtained, and the deepest feature F is selected;

an input module 2: the system comprises an attention scale sensing module, a clustering module and a storage module, wherein the attention scale sensing module is used for inputting the deepest layer characteristics F into the attention scale sensing module and obtaining an average pooling result and a maximum pooling result of the deepest layer characteristics F;

convolution application module 3: the attention sensing module is used for respectively applying the 1 × 1 convolutional layer shared by the weights to the average pooling result and the maximum pooling result to obtain a feature F1 and a feature F2;

the attention map generation module 4: generating attention A for emphasizing corresponding features in the deepest features F based on the features F1 and the features F2;

the scaling module 5: for generating a scaled feature Fscale based on feature F1, feature F2 and attention a;

element summation module 6: the method is used for carrying out element summation operation on the feature Fscale and the input deepest feature F through the learnable parameter a to obtain the feature F _SSAM ；

An output module 7: features F for sensing module output for attention scale _SSAM Executing up-sampling operation, and fusing the sampling result with the multilayer feature graphs of different levels to obtain a feature X _m Let the feature X _m Outputting a segmentation result of a defect area of the image to be segmented after one layer of convolution;

it can be understood that the image to be segmented is input into the segmentation network through the feature map acquisition module 1, after multilayer convolution and pooling operations, multilayer feature maps with different levels are obtained, the deepest feature F is selected, and the deepest feature F is input into the attention scale perception module through the input module 2 for acquisitionAverage pooling results and maximum pooling results of the deepest layer features F; the convolution application module 3 respectively applies the 1 × 1 convolution layer shared by the weights to the average pooling result and the maximum pooling result through the attention sensing module to obtain a feature F1 and a feature F2; the attention map generation module 4: generating attention A for emphasizing corresponding features in the deepest features F based on the features F1 and the features F2; the attention map generation module 4 generates attention a emphasizing a corresponding feature in the deepest feature F based on the feature F1 and the feature F2; the scaling module 5 generates a scaled feature Fscale based on the feature F1, the feature F2 and the attention a; the element summation module 6 carries out element summation operation between the feature Fscale and the input deepest feature F through the learnable parameter a to obtain the feature F _SSAM (ii) a The output module 7 is used for sensing the characteristics F output by the module on the attention scale _SSAM Executing up-sampling operation, and fusing the sampling result with the multilayer feature graphs of different levels to obtain a feature X _m Let the feature X _m Outputting a segmentation result of a defect area of the image to be segmented after one layer of convolution; according to the scheme, the attention scale perception module is used, the scale characteristics of the multi-defect target are learned and judged through the attention mechanism capable of perceiving the scale, the remote dependency relationship can be effectively captured with low calculation cost, the discriminant characteristics of the defect target are effectively focused, the interference of a complex background is inhibited, and the problem that the characteristics such as background textures are easily confused with the defects is effectively avoided.

EXAMPLE III

The present embodiment also discloses a storage medium storing a computer program which, when executed by a processor, implements each step in the attention metric perception guided lightweight U-net method as described in any one of the above;

it will be appreciated that the storage medium referred to above may be a read-only memory, a magnetic or optical disk, or the like.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried out in the method of implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and that changes, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An attention scale perceptually-guided lightweight U-net method, the method comprising:

the attention sensing module respectively applies the 1 multiplied by 1 convolutional layers shared by the weights to the average pooling result and the maximum pooling result to obtain a characteristic F1 and a characteristic F2;

based on the features F1 and the features F2, generating attention A emphasizing corresponding features in the deepest features F through a softmax function, wherein the attention A is used for emphasizing the importance of the corresponding features in the deepest features F;

generating a scaled feature Fscale based on the feature F1, the feature F2 and the attention A;

Feature F output by attention scale perception module _SSAM Executing up-sampling operation, and fusing the sampling result with multi-layer characteristic graphs of different levels to obtain a characteristic X _m Let the feature X _m And outputting the segmentation result of the defect area of the image to be segmented after one layer of convolution.

2. The method of claim 1,

inputting the deepest layer characteristics F into an attention scale perception module;

3. The method of claim 2,

then, the characteristic X1 is subjected to up-sampling operation, the steps are repeated until U-net type channel dimension splicing fusion is carried out on the characteristic X1 and the shallowest layer characteristic diagram, and convolution operation and activation operation are carried out on the fusion result to obtain the characteristic X _m ；

Will feature X _m And outputting a segmentation result of the defect area of the image to be segmented after one-layer convolution.

4. The method of claim 3,

the generating a scaled feature Fscale comprises:

an element multiplication operation is performed between the attention a, the feature F1 and the feature F2, respectively, for generating a scaled feature Fscale.

5. The method of claim 4,

6. An attention metric perceptually-guided lightweight U-net device, said device comprising:

a feature map acquisition module: the system is used for inputting an image to be segmented into a segmentation network, obtaining a plurality of layers of feature maps with different levels after multilayer convolution and pooling operation, and selecting a deepest feature F;

an input module: the system is used for inputting the deepest features F into the attention scale perception module and obtaining an average pooling result and a maximum pooling result of the deepest features F;

a convolution application module: the 1 × 1 convolutional layer is used for respectively applying the weight sharing values to the average pooling result and the maximum pooling result through the attention sensing module to obtain a characteristic F1 and a characteristic F2;

7. A storage medium storing a computer program which, when executed by a processor, performs the steps of the attention metric aware guided lightweight U-net method according to any of claims 1 to 5.