CN112668584A

CN112668584A - Intelligent detection method for portrait of air conditioner external unit based on visual attention and multi-scale convolutional neural network

Info

Publication number: CN112668584A
Application number: CN202011545170.XA
Authority: CN
Inventors: 袁东风; 狄子钧; 梁聪
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-04-16

Abstract

The invention relates to an air conditioner outdoor unit portrait intelligent detection method based on visual attention and a multi-scale convolution neural network, which comprises the following steps: (1) data preprocessing: manually classifying the portrait samples of the air conditioner outdoor unit to generate correct and wrong labels. (2) Reading the preprocessed sample image, inputting the sample image into a visual attention network, and generating an attention distribution graph; (3) inputting a multi-scale network for training to obtain a deep fusion feature vector; (4) training by taking the depth fusion feature vector as the input of a softmax classifier model; (5) inputting the verification sample set into a softmax classifier model to verify the classification precision, and obtaining a trained softmax classifier model; (6) and inputting the test sample set into the trained softmax classifier model to obtain a correct or wrong classification result of the test sample set. And the gradient is conducted in the reverse process, so that a deeper model can be successfully trained, and the performance of the network is improved.

Description

Intelligent detection method for portrait of air conditioner external unit based on visual attention and multi-scale convolutional neural network

Technical Field

The invention relates to the technical field of intelligent portrait detection of an air conditioner outdoor unit, in particular to an intelligent portrait detection method of the air conditioner outdoor unit based on visual attention and a multi-scale network.

Background

Due to the difference of the models of the air conditioner outdoor units, the attached icons are different, and the used connecting pipes are different. The time and the labor are consumed by manual detection, under the background of an industrial internet, a neural network is expected to be applied to portrait detection, the traditional method of detecting by means of manual operation is replaced, whether a matching connecting pipe and an icon of the product are accurate or not can be judged rapidly in real time, results are fed back to a factory in real time, detection of an air conditioner outdoor unit is completed efficiently and at low cost, accordingly, a production line is managed more effectively, flexibility is enhanced, production cost is reduced, and enterprise benefits are improved. No intelligent detection technology for images of outdoor units of air conditioners is available, and the research on the technology is expected to be applied to the scene.

When processing an image, a neural network is equally processing all features of the image. By selectively assigning attention to different portions of the input, with reference to human vision, regions of interest can be selectively extracted from the picture or video. The extracted regions are processed and the information is progressively combined to create a dynamic internal representation of the scene or environment. The visual attention model can be used to extract features of a target region of interest in an image, and this concept has been applied to the field of visual recognition and classification. In the problem of intelligent portrait detection of an air conditioner external unit, attention is paid to colors and icons of pipe orifices, and attention areas among different air conditioner types have pixel value differences.

Convolutional neural networks are widely used in image recognition. Convolutional neural networks extract multi-scale information from images through convolution operations, and through deeper architectures, more subtle features can be extracted. For large neural networks, convolutional neural networks may also have sparse connections, avoiding overfitting. In order to extract the multi-resolution characteristics of an image, Hu provides an improved multi-scale convolutional neural network, the network comprises a three-branch structure, the three branches comprise convolutional layers with different layers, residual errors are connected among the convolutional layers with the same size, the model can effectively extract relevant characteristics and abstract the convolutional layers with different layers, and the network optimization effect is obviously improved through the residual error connection. Although the multiscale network has advantages in feature extraction, it has some limitations in portrait detection: when the multi-scale network extracts features, irrelevant information and concerned information are trained simultaneously, and weights are equivalently distributed, so that the difficulty of calculation and analysis is increased, and the information processing efficiency is low.

Disclosure of Invention

Aiming at the problem of intelligent detection of the image of the air conditioner external unit, the invention aims to provide an intelligent detection method of the image of the air conditioner external unit based on visual attention and a multi-scale convolutional neural network.

In the present invention, first, the difference in pixel values of different attention areas in a picture is used as a label, and an attention mechanism is introduced to perform visual information processing, thereby learning the attention area and its surrounding structure and generating an attention distribution map. Secondly, inputting the generated attention distribution map into a multi-scale network for training. The multi-scale network comprises a three-branch structure, the three branches comprise different convolution layers, features with different resolutions can be extracted, and finally, the features are combined through full connection to realize feature fusion. In the three branches, residual errors are connected among different layers with the same scale feature map, so that the features in the network can be subjected to identity mapping in the forward process, the gradient can be conducted in the reverse process, and the performance of the network is improved.

Interpretation of terms:

the softmax classifier model is a common linear classifier, and is a form that Logistic regression is popularized to multi-class classification. Modeled as a polynomial Distribution (Multinomial Distribution), it can be classified into a number of mutually exclusive categories.

The technical scheme of the invention is as follows:

an intelligent detection method for an outdoor unit portrait of an air conditioner based on visual attention and a multi-scale convolutional neural network comprises the following steps:

(1) data preprocessing: manually classifying the portrait samples of the air conditioner outdoor unit, wherein the concerned areas in the portrait samples are the colors of the icons and the pipe orifices of the connecting pipes, generating correct and wrong labels according to whether the portrait samples are pasted with the icons, whether the icons are matched with the models of the outdoor units, whether the connecting pipes are arranged, and whether the colors of the pipe orifices of the connecting pipes are matched with the models of the outdoor units, wherein the sample with the correct label is an image of the air conditioner outdoor unit comprising the icons and the pipe orifices of the two outdoor unit connecting pipes, and the colors of the icons and the pipe orifices of the outdoor unit connecting pipes are matched with the models of the air conditioner outdoor.

(2) Reading the sample image preprocessed in the step (1), inputting the sample image into a visual attention network, learning a region needing to draw attention and surrounding structures of the region needing to draw attention, namely connecting pipe orifices of icons and two external machines, and generating an attention distribution map; dividing the generated attention distribution map into a training sample set, a verification sample set and a test sample set;

(3) inputting the attention distribution map into a multi-scale network for training, and realizing the feature fusion of the three convolutional layers through full connection to obtain a deep fusion feature vector;

(4) taking the depth fusion characteristic vector of the training sample set as the input of a softmax classifier model, taking the correct and wrong labels as the output of the softmax classifier model, and training a model formed by a multi-scale network and the softmax classifier model;

(5) inputting the verification sample set into a softmax classifier model to verify classification precision, and updating model parameters of the softmax classifier to obtain a trained softmax classifier model;

(6) and inputting the test sample set into a trained softmax classifier model to obtain a correct or wrong classification result of the test sample set.

Preferably, according to the present invention, the visual attention network is formed by stacking a plurality of residual attention modules, each of which includes two branches: a trunk branch and a mask branch;

the main branch is a basic residual error network structure, and the image is subjected to feature extraction to generate a feature map with the same size as the original image;

the mask branch is a structure combining top-down with bottom-up, high-level features are gradually extracted and the receptive field of a residual module is increased through a residual module and a down-sampling layer, the down-sampling is completed through pooling, then the feature map is amplified into a feature map with the same size as the original image through an up-sampling layer with the same number of down-sampling layers, the up-sampling is completed through bilinear interpolation, an attention mask is finally generated, and the mask branch plays a role of a feature selector;

the feature graph output by the main branch and the attention mask output by the mask branch are multiplied by corresponding pixel points, the weight of the attention mask is distributed on the feature graph of the main branch, the activation function of the mask branch is a sigmoid function, the value of the mask is distributed between (0,1), the output response of the feature graph is poor when the value of the mask is multiplied by the feature graph, and after a plurality of residual attention modules are stacked, the value of the final attention distribution graph becomes smaller and smaller, so that the training is difficult. Therefore, referring to the residual error network structure, the result obtained by multiplying is added with the feature graph output by the main branch to carry out addition among corresponding pixel points, and finally the attention distribution graph is output.

Further preferably, in the step (2), the sample image preprocessed in the step (1) is input into a visual attention network to generate an attention distribution map, specifically:

inputting the sample image x preprocessed in the step (1) into a visual attention network, outputting and extracting a main branch to obtain a feature map T (x), and outputting an attention mask M (x) by a mask branch; t (x) learns attention for its features through its corresponding m (x) which is equivalent to a soft weight of t (x); adding identity mapping into the residual attention module, the attention distribution graph h (x) output by the visual attention network is shown as formula (i):

H(x)＝(1+M(x))*T(x) (Ⅰ)

in formula (I), M (x) has a value range of [0,1], and when M (x) is approximate to 0, H (x) is approximate to a characteristic diagram T (x). M (x) can enhance good features and suppress noise of trunk branches.

According to the optimization of the invention, the multi-scale network is three branch models comprising different convolution layer numbers, feature abstractions with different levels can be effectively extracted, the scale is adjusted by utilizing the convolution layer numbers, smaller resolution can display more local features, higher resolution can display more global features, and the combination of the two can effectively improve the network performance. In the three branch models, residual errors are connected among different layers with the same scale characteristic diagram; residual connection among different layers helps the characteristics in the multi-scale network to perform identity mapping in the forward process, when the output of a shallow network is optimal, the layers behind a deep network can realize the role of identity mapping, and helps to conduct gradients in the reverse process, so that a deeper model can be successfully trained, and the performance of the network is improved; and then, combining features through full connection, fusing feature graphs with different resolution ratios into a dimensional vector in parallel, realizing feature fusion of different levels, and finally obtaining output through a softmax classifier model.

Further preferably, the three branch models include a first branch model, a second branch model, a third branch model,

after the attention distribution map is inputted into the multi-scale network, it passes through the 5 convolutional layers included in the first branch model, wherein, the size of the feature map after convolution of the 1 st layer is reduced to 1/4 of the size of the original image, the number of feature maps is increased to 4 times of the original image, the size of the feature map after convolution of the 2 nd layer is reduced to 1/16 of the size of the original image, the number of feature maps is increased to 16 times of the size of the original image, the size of the feature map after convolution of the 3 rd layer is reduced to 1/64 of the size of the original image, the number of feature maps is increased to 64 times of the original image, the size of the feature map after convolution of the 4 th layer is reduced to 1/256 of the size of the original image, the number of feature maps is increased to 256 times of the original image, the size of the feature map after convolution of the 5 th layer is reduced to 1/1024 of the size of the original image, and the number of feature maps is increased to 1024 times of the;

the 2 convolutional layers included in the second branch model, wherein the size of the feature map after convolution of the 1 st convolutional layer is reduced to 1/16 of the size of the original image, the number of feature maps is increased to 16 times of the size of the original image, the size of the feature map after convolution of the 2 nd convolutional layer is reduced to 1/256 of the size of the original image, and the number of feature maps is increased to 256 times of the size of the original image;

after the convolution layer is convolved by 1 convolution layer included in the third branch model, the size of the feature map is reduced to 1/16 of the size of the original image, and the number of feature maps is increased to 16 times of the size of the original image.

Further preferably, in the multi-scale network, identity maps are introduced between layers having the same size and feature map number, and the 1 st convolutional layer of the first branch model and the 1 st convolutional layer of the second branch model, the 4 th convolutional layer of the first branch model and the 2 nd convolutional layer of the second branch model, and the 1 st convolutional layer of the second branch model and the convolutional layer of the third branch model are respectively connected by residual errors.

Further preferably, feature graphs with different resolution sizes are fused into a one-dimensional vector in parallel by fully connecting and combining features output by the three branch models, so that feature fusion of different levels is realized.

The invention has the beneficial effects that:

1. the invention adopts the visual attention model to generate the attention distribution map, and can extract the characteristics of the attention area and the surrounding structure in the image.

2. The invention adopts the multi-scale convolution neural network, can fully excavate the characteristics of various resolutions of the attention distribution map, and realizes the characteristic fusion of different levels by merging the characteristics through full connection and fusing the characteristic maps with different resolutions into a one-dimensional vector in parallel.

3. In the invention, in three branches of the multi-scale convolutional neural network, residual connection is carried out between different layers with the same scale characteristic diagram, and the characteristics in the network are subjected to identity mapping in the forward process, so that when the output of a shallow layer is optimal, the layers behind a deep layer network can realize the effect of identity mapping. And the gradient is conducted in the reverse process, so that a deeper model can be successfully trained, and the performance of the network is improved.

Drawings

FIG. 1 is a flow chart diagram of an intelligent detection method for an outdoor unit portrait of an air conditioner based on visual attention and a multi-scale convolutional neural network.

FIG. 2 is a schematic diagram of the structure of the visual attention model of the present invention.

Fig. 3 is a schematic structural diagram of the multi-scale network of the present invention.

Detailed Description

The invention is further defined in the following, but not limited to, the figures and examples in the description.

Example 1

(6) and inputting the test sample set into a trained softmax classifier model to obtain the classification precision of correctness and errors of the test sample set.

Example 2

The intelligent detection method for the portrait of the outdoor unit of the air conditioner based on the visual attention and the multi-scale convolutional neural network is characterized in that:

as shown in fig. 2, the visual attention network is formed by stacking a plurality of residual attention modules, each of which includes two branches: a trunk branch and a mask branch; the main branch is a basic residual error network structure, and the image is subjected to feature extraction to generate a feature map with the same size as the original image; the mask branch is a structure combining top-down with bottom-up, high-level features are gradually extracted and the receptive field of a residual module is increased through a residual module and a down-sampling layer, the down-sampling is completed through pooling, then the feature map is amplified into a feature map with the same size as the original image through an up-sampling layer with the same number of down-sampling layers, the up-sampling is completed through bilinear interpolation, an attention mask is finally generated, and the mask branch plays a role of a feature selector; the feature graph output by the main branch and the attention mask output by the mask branch are multiplied by corresponding pixel points, the weight of the attention mask is distributed on the feature graph of the main branch, the activation function of the mask branch is a sigmoid function, the value of the mask is distributed between (0,1), the output response of the feature graph is poor when the value of the mask is multiplied by the feature graph, and after a plurality of residual attention modules are stacked, the value of the final attention distribution graph becomes smaller and smaller, so that the training is difficult. Therefore, referring to the residual error network structure, the result obtained by multiplying is added with the feature graph output by the main branch to carry out addition among corresponding pixel points, and finally the attention distribution graph is output.

Inputting the sample image preprocessed in the step (1) into a visual attention network to generate an attention distribution map, specifically comprising the following steps:

H(x)＝(1+M(x))*T(x) (Ⅰ)

The generated attention profile h (x) is divided into a training set S1, a verification set S2, and a test set S3.

Example 3

the multi-scale network is provided with three branch models with different convolution layer numbers, feature abstractions with different levels can be effectively extracted, the scale size is adjusted by utilizing the convolution layer numbers, smaller resolution can display more local features, higher resolution can display more global features, and the combination of the two can effectively improve the network performance. In the three branch models, residual errors are connected among different layers with the same scale characteristic diagram; residual connection among different layers helps the characteristics in the multi-scale network to perform identity mapping in the forward process, when the output of the shallow network is optimal, the layers behind the deep network can realize the role of identity mapping, and helps to conduct gradients in the reverse process, so that a deeper model can be successfully trained, and the performance of the network is improved; and then, combining features through full connection, fusing feature graphs with different resolution ratios into a dimensional vector in parallel, realizing feature fusion of different levels, and finally obtaining output through a softmax classifier model.

As shown in fig. 3, the three branch models include a first branch model, a second branch model, and a third branch model, and the attention distribution map is inputted into the multi-scale network and passes through 5 convolutional layers included in the first branch model, wherein the feature map size after convolution of the 1 st convolutional layer is reduced to 1/4 of the original image size, the number of feature maps is increased to 4 times of the original image size, the feature map size after convolution of the 2 nd convolutional layer is reduced to 1/16 of the original image size, the number of feature maps is increased to 16 times of the original image size, the feature map size after convolution of the 3 rd convolutional layer is reduced to 1/64 of the original image size, the number of feature maps is increased to 64 times of the original image size, the feature map size after convolution of the 4 th convolutional layer is reduced to 1/256 of the original image size, the number of feature maps is increased to 256 times of the original image size, the feature map size after convolution of the 5 th convolutional layer is reduced to 1/1024 of the original image size, the number of feature mappings is increased to 1024 times of the original image;

In the multi-scale network, identity mapping is introduced between layers with the same size and feature mapping quantity, and the 1 st convolutional layer of the first branch model and the 1 st convolutional layer of the second branch model, the 4 th convolutional layer of the first branch model and the 2 nd convolutional layer of the second branch model, and the 1 st convolutional layer of the second branch model and the convolutional layer of the third branch model are respectively connected in a residual error mode.

By fully connecting and combining the features output by the three branch models, feature graphs with different resolution ratios are fused into a one-dimensional vector in parallel, and feature fusion of different levels is realized.

The parameter settings for the multi-scale convolutional neural network are shown in table 1.

TABLE 1

Parameter(s)	Parameter value
		Epoch	10
BatchSize	64
		Learningrate	0.0003
Optimizer	Adam

Table 2 shows comparative data of the training results and the classification results of the test set for the multi-scale network with and without the visual attention network.

TABLE 2

Multi-scale convolutional neural network	Verification set accuracy (%)	Test set accuracy (%)
			Adding visual attention	87.00	53.00
Without adding visual attentionForce of	15.95	18.75

Claims

1. An intelligent detection method for an outdoor unit portrait of an air conditioner based on visual attention and a multi-scale convolutional neural network is characterized by comprising the following steps:

(1) data preprocessing: manually classifying the portrait samples of the air conditioner outdoor unit to generate correct and wrong labels, wherein the correct label is the image of the air conditioner outdoor unit, which comprises the icons and the pipe orifices of the two outdoor unit connecting pipes, and the colors of the icons and the pipe orifices of the outdoor unit connecting pipes are matched with the model of the air conditioner outdoor unit, otherwise, the correct label is the wrong label;

2. The method as claimed in claim 1, wherein the visual attention network is formed by stacking a plurality of residual attention modules, each residual attention module comprising two branches: a trunk branch and a mask branch;

multiplying the feature graph output by the trunk branch with the attention mask output by the mask branch by corresponding pixel points, distributing the weight of the attention mask to the feature graph of the trunk branch, adding the result obtained by multiplying with the feature graph output by the trunk branch by the corresponding pixel points, and finally outputting an attention distribution graph.

3. The intelligent detection method for the portrait of the outdoor unit of the air conditioner based on the visual attention and the multi-scale convolutional neural network as claimed in claim 1, wherein in the step (2), the sample image preprocessed in the step (1) is input into the visual attention network to generate an attention distribution map, specifically:

H(x)＝(1+M(x))*T(x) (Ⅰ)

in formula (I), M (x) has a value range of [0,1], and when M (x) is approximate to 0, H (x) is approximate to a characteristic diagram T (x).

4. The intelligent detection method for the portrait of the outdoor unit of the air conditioner based on the visual attention and the multi-scale convolutional neural network as claimed in any one of claims 1 to 3, wherein the multi-scale network comprises three branch models with different convolutional layer numbers, the feature abstractions with different levels are extracted, the scale is adjusted by using the convolutional layer numbers, and in the three branch models, the different layers with the same scale feature map are connected through residual errors; residual connection among different layers helps the characteristics in the multi-scale network to perform identity mapping in the forward process, when the output of a shallow network is optimal, the layers behind a deep network can realize the role of identity mapping, and helps to conduct gradients in the reverse process, so that a deeper model can be successfully trained, and the performance of the network is improved; and then, combining features through full connection, fusing feature graphs with different resolution ratios into a dimensional vector in parallel, realizing feature fusion of different levels, and finally obtaining output through a softmax classifier model.

5. The intelligent detection method for portrait of outdoor unit of air conditioner based on visual attention and multi-scale convolutional neural network of claim 4, wherein the three branch models comprise a first branch model, a second branch model and a third branch model,

6. The method as claimed in claim 5, wherein identity mapping is introduced between layers having the same size and feature mapping number in the multi-scale network, and the 1 st convolutional layer of the first branch model and the 1 st convolutional layer of the second branch model, the 4 th convolutional layer of the first branch model and the 2 nd convolutional layer of the second branch model, and the 1 st convolutional layer of the second branch model and the convolutional layer of the third branch model are respectively connected by residual errors.

7. The intelligent detection method for the portrait of the outdoor unit of the air conditioner based on the visual attention and the multi-scale convolutional neural network as claimed in claim 6, wherein feature graphs with different resolution sizes are fused into a one-dimensional vector in parallel by fully connecting and combining features output by three branch models, so that feature fusion of different levels is realized.