CN113656628B

CN113656628B - Crane image retrieval method based on attention mechanism and feature fusion

Info

Publication number: CN113656628B
Application number: CN202110565871.8A
Authority: CN
Inventors: 张燕超; 李向东
Original assignee: Special Equipment Safety Supervision Inspection Institute of Jiangsu Province
Current assignee: Special Equipment Safety Supervision Inspection Institute of Jiangsu Province
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2023-03-28
Anticipated expiration: 2041-05-24
Also published as: CN113656628A

Abstract

The invention relates to a crane image retrieval method based on attention mechanism and feature fusion, which comprises the following steps: carrying out fine tuning training on the backbone network by taking the image fine classification task as a target; extracting VAMAC characteristics and GRMAAC characteristics of crane pictures in a database, and correspondingly storing the VAMAC characteristics and the GRMAAC characteristics in a GRMAAC characteristic library; extracting VAMAC characteristics and GRMAAC characteristics of the crane picture to be inquired; calculating similarity between VAMAC characteristics of the crane picture to be inquired and VAMAC characteristics in a VAMAC characteristic library by adopting cosine similarity, calculating similarity of the crane picture to be inquired and sequencing the similarity; several most similar pictures are chosen. The invention can quickly search out the crane and accessories in the database in real time, saves the labor cost and eliminates the interference of artificial subjective factors, and has the advantages of high efficiency, high accuracy, strong practicability and the like.

Description

Crane image retrieval method based on attention mechanism and feature fusion

Technical Field

The invention belongs to the technical field of image retrieval, and particularly relates to a crane image retrieval method based on attention mechanism and feature fusion.

Background

In recent years, with the economic development and social progress of China, the scale of infrastructure construction is getting larger and larger, and meanwhile, the use demand of the crane is increasing day by day. In actual engineering construction, the crane needs to be selected according to requirements, or appropriate accessories are found out, and the crane with the appropriate production standard is assembled on site. However, many times the manufacturer is unaware of the parameters and usage criteria required for a particular crane and a particular fitting. If the accurate equipment is found out by manually comparing each piece of equipment, the problems of high work repeatability, low efficiency and high labor intensity exist. At this time, image retrieval methods are needed to help the production personnel obtain the specific parameters and use standards of the specific crane and fitting through the pictures taken on site.

The conventional image retrieval method includes:

(1) The method of non-deep learning: SIFT features (namely scale invariant feature transformation) are extracted from database pictures and stored as a feature database, SIFT features are extracted from query pictures and pictures with the maximum similarity are paired to serve as retrieval results, but the method is long in time consumption, does not meet the requirements of real-time query, is low in accuracy and is difficult to meet the requirements of work;

(2) The method based on deep learning comprises the following steps: the method comprises the steps of firstly obtaining convolutional layer characteristics by using a backbone network, then obtaining characteristics of database images and query images by using a pooling method, and then pairing the most similar images as a retrieval result.

In summary, although the image retrieval technology has been greatly developed, there is a few technology that applies the image retrieval technology to the mechanical image retrieval. The existing image retrieval methods are reliable for specific fields, but the difference between the specific fields and mechanical equipment is large, the characteristics of crane equipment are not considered, the accuracy of the specific fields is still not high enough, and the accuracy is further improved.

Disclosure of Invention

The invention provides a crane image retrieval method based on attention mechanism and feature fusion, which is high in accuracy.

The technical scheme adopted by the invention is as follows:

a crane image retrieval method based on attention mechanism and feature fusion comprises the following steps:

s1, constructing a backbone network, and performing fine tuning training by taking an image fine classification task as a target;

s2, extracting VAMAC characteristics of the crane picture in the database by a MAC pooling method integrated with a variable attention mechanism, and storing the VAMAC characteristics in a VAMAC characteristic library;

s3, extracting GRMAAC characteristics of crane pictures in a database by a GRMAC multi-scale frame pooling method of Avg-pooling and Lp-pooling, and storing the GRMAAC characteristics in a GRMAAC characteristic library;

s4, extracting VAMAC characteristics and GRMAAC characteristics of the crane picture to be inquired by the methods of S2 and S3;

s5, calculating the similarity between VAMAC characteristics of the crane picture to be inquired and VAMAC characteristics in a VAMAC characteristic library by adopting cosine similarity, calculating the similarity between GRMAAC characteristics of the crane picture to be inquired and GRMAAC characteristics in a GRMAAC characteristic library, and respectively generating the VAMAC similarity and the GRMAAC similarity; adding the two similarities by a specific coefficient to obtain a total similarity table, and sorting the total similarity table according to the similarities;

and S6, selecting a plurality of similarity degrees which are ranked most at the front in the total similarity table, taking crane pictures in a plurality of corresponding databases as output results, and associating the crane pictures in each database with information and technical parameters of corresponding cranes or equipment for inquiry.

Further, step S1 includes: selecting ResNet101 with an ImageNet data set trained in advance as a main network, and fine-tuning the main network by adopting a screened and cleaned products10k data set.

Further, step S2 includes: and generating a variable attention mask by using a variable attention mechanism, filtering the convolutional layer characteristics extracted by the main network through the generated variable attention mask, retaining information of a crane target, filtering background information influencing retrieval, and performing Max-posing on the filtered convolutional layer characteristics to obtain VAMAC characteristics.

Further, the specific calculation method of the variable attention mechanism is as follows: for the convolutional layer characteristics F (H multiplied by W multiplied by C) output by the backbone network, H, W and C sequentially represent the height, width and channel number of the convolutional layer characteristics; adding the convolution layer features along the dimension of the channel to obtain a sum feature layer A (H multiplied by W), and selecting the minimum pixel value A in the sum feature layer A _min Calculating and characterizing each pixel value in layer A and A _min The difference value is subjected to p norm, and the p norm is added and standardized to obtain a mask discrimination threshold T; for the sum characteristic layer A, judging the pixel with the pixel value larger than the threshold value T as 1; judging the pixel value smaller than the threshold value T to be 0 to obtain a variable attention mask, wherein the pixel with the value of 1 in the mask belongs to the target, and the pixel with the value of 0 belongs to the background; by adjusting the p-value, ma is obtained which is adapted to the task characteristicssk。

Further, step S3 includes: for convolutional layer characteristics extracted by a backbone network, filtering by using a variable attention mask, and then extracting GRMAAC characteristics by using a GRMAC multi-scale frame pooling method fused with Avg-pooling and Lp-pooling;

for the GRMAC multi-scale frame pooling method, the MAC pooling method is used for improvement in all scales of the multi-scale pooling frame; for the large-scale frame, the Avg-pooling method is adopted; for small-scale frames, the Max-pooling method is still used; and adopting an Lp-poling method for the medium-scale frame between the large-scale frame and the small-scale frame.

Further, in the step S2 and the step S3, the process of extracting the VAMAC feature and the GRMAAC feature of the crane picture in the database is performed offline; in the step S4, the process of extracting VAMAC characteristics and GRMAAC characteristics of the crane picture to be inquired is carried out on line in real time.

Further, in step S6, seven database crane pictures with the highest similarity are selected and are sorted in descending order as the result of the image retrieval.

The invention has the beneficial effects that:

the invention can quickly retrieve the crane and accessories in the database in real time through the pictures shot on site, saves the labor cost and eliminates the interference of human subjective factors, and has the advantages of high efficiency, high accuracy, strong practicability and the like. Meanwhile, the method can be popularized and applied to graph retrieval in other fields, and applicability is strong.

Drawings

FIG. 1 is a flow chart of a crane image retrieval method based on attention mechanism and feature fusion according to the present invention;

FIG. 2 is a schematic diagram of an algorithm framework of the present invention;

FIG. 3 is a schematic diagram of a variable attention mask according to the present invention;

FIG. 4 is a schematic diagram of an improved multi-scale frame of the present invention;

FIG. 5 is a diagram illustrating the search result according to the present invention.

Detailed Description

The crane image retrieval method based on attention mechanism and feature fusion of the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

A crane image retrieval method based on attention mechanism and feature fusion is disclosed, a flow chart of the crane image retrieval method is shown in figure 1, an algorithm framework is shown in figure 2, and the method comprises the following steps:

s1, constructing a backbone network, and performing fine tuning training by taking an image fine classification task as a target. The method comprises the following steps:

selecting ResNet101 with an ImageNet dataset pre-trained as a backbone network, and fine-tuning the backbone network by adopting a screened and cleaned products10k dataset (pictures contained in the processed dataset are similar to a crane and accessories thereof), so that the backbone network can obtain the capability of extracting image features, and the fine classification features are very suitable for image retrieval tasks.

And S2, extracting VAMAC characteristics of the crane picture in the database by a MAC pooling method integrated with a variable attention mechanism, and storing the VAMAC characteristics in a VAMAC characteristic library.

The method comprises the following steps: and generating a variable attention mask by using a variable attention mechanism, wherein the mask is as shown in fig. 3, filtering the convolutional layer characteristics extracted from the main network through the generated variable attention mask, retaining information of a crane target, filtering background information influencing retrieval, and performing Max-posing on the filtered convolutional layer characteristics to obtain VAMAC characteristics. Finally, the extracted VAMAC features are stored in a VAMAC feature library.

The specific calculation method of the variable attention mechanism comprises the following steps: the convolutional layer characteristics F (H multiplied by W multiplied by C) output by the backbone network, H, W and C sequentially represent the height, width and channel number of the convolutional layer characteristics. Adding the convolutional layer features along the dimension of the channel to obtain a sum feature layer A (H multiplied by W), and selecting the minimum pixel value A in the sum feature layer A _min Calculating the sum of each pixel value in the sum-feature layer A _min And performing p-norm, summing and standardization to obtain a mask discrimination threshold T. For the sum-feature layer a, pixels whose pixel values are greater than the threshold value T are judged to be 1. If the pixel value smaller than the threshold value T is 0, the variable value is obtainedAttention is paid to a mask, a pixel with a value of 1 in the mask belongs to a target, and a pixel with a value of 0 belongs to a background. The variable attention mask can obtain a mask suitable for the picture characteristics of the crane by adjusting the p value (by a large amount of experiments, the accuracy is taken as a judgment standard, other parameters are kept unchanged, the appropriate p value is obtained, the mask can be ensured to retain more crane information and filter more background information), most of targets of the crane and accessories of the crane are large in area, sensitive to overall characteristics and insensitive to local characteristics, and therefore a small p value is needed to generate a wide mask. As shown in table 1, the experimental results show that a smaller p value is beneficial to improving the image retrieval accuracy of the crane target.

TABLE 1 VAMAC parameter adjustment Table

Serial number	p	Map@7
				1	0.5	0.4558
2	1.0	0.4597
			3	2.0	0.4572
4	3.0	0.4514

And S3, extracting the GRMAAC characteristics of the crane pictures in the database by a GRMAC multi-scale frame pooling method of Avg-pooling and Lp-pooling, and storing the GRMAAC characteristics in a GRMAAC characteristic library. The method comprises the following steps:

for convolutional layer features extracted by a backbone network, filtering by using a variable attention mask, then performing GRMAAC feature extraction by using a GRMAC multi-scale frame pooling method fused with Avg-pooling and Lp-pooling, and storing as a GRMAAC feature library.

The GRMAC multi-scale frame pooling method is improved by using the MAC pooling method in all the scale frames of the multi-scale pooling frame, and is an improved multi-scale frame as shown in fig. 4. The multi-scale frames in the GRMAC method are frames with three scales, the number of each frame is different, all the frames of each frame are overlapped, and the whole area of the convolution layer can be covered. For the large-scale frame containing the significant information and the detail information, extracting the features containing the significant information and the detail information by adopting average pooled Avg-pooling. For small-scale frames, the number of the frames is large, each frame contains less information, the most important information needs to be extracted, and the most significant information is extracted by adopting Max-posing of maximum pooling. For the medium-scale frame, the Lp-posing method between the maximum pooling and the average pooling is adopted to extract the characteristics which balance the significant information and the detailed information. And then, the GRMAAC characteristics are obtained by adding the three characteristics by adjusting parameters and setting proper coefficients. As shown in table 2, a good accuracy can be obtained by adjusting the addition coefficient through a large number of experiments.

For the multi-scale frame pooling summation coefficient qt (t represents a frame) in the GRMAAC, the accuracy is taken as a standard, other parameters are kept unchanged, and the optimal summation coefficient is obtained through experiments, so that the extracted features contain enough significant information and detail information. The GRMAAC similarity and the VAMAC similarity of the crane picture to be inquired and the crane picture in the database are respectively calculated, then the GRMAAC similarity and the VAMAC similarity are obtained by fixing other parameters and taking the accuracy as a judgment standard and testing a proper weighting coefficient ps to sum the two similarities, and the proportion of more favorable similarities in the total similarities is ensured to be larger.

Table 2 GRMAC parameter adjustment table

Serial number	p ₁	p ₂	Map@7
				1	0.5	0.5	0.4402
2	0.5	1.0	0.4395
				3	0.5	2.0	0.4306
4	1.0	0.5	0.4369
				5	1.0	1.0	0.4325
6	1.0	2.0	0.4257
				7	2.0	0.5	0.4213
8	2.0	1.0	0.4129
				9	2.0	2.0	0.4076

And S4, extracting VAMAC characteristics and GRMAAC characteristics of the crane picture to be inquired (shot by a user on site) by the methods of the S2 and the S3.

In the step S2 and the step S3, the process of extracting the VAMAC feature and the GRMAAC feature of the crane picture in the database is performed offline. In the step S4, the process of extracting VAMAC characteristics and GRMAAC characteristics of the crane picture to be inquired is carried out on line in real time.

And S5, calculating the similarity between the VAMAC characteristics of the crane picture to be inquired and the VAMAC characteristics in the VAMAC characteristic library by adopting cosine similarity, calculating the similarity between the GRMAAC characteristics of the crane picture to be inquired and the GRMAAC characteristics in the GRMAAC characteristic library, and respectively generating the VAMAC similarity and the GRMAAC similarity. And adding the two similarities by using a specific coefficient (a system selected for the experiment) to obtain a total similarity table, and sequencing the total similarity table according to the similarity.

And S6, selecting a plurality of similarity degrees which are ranked most at the front in the total similarity table, taking crane pictures in a plurality of corresponding databases as output results, and associating the crane pictures in each database with information and technical parameters of corresponding cranes or equipment for inquiry. In this embodiment, seven crane pictures with the highest similarity are selected and arranged in descending order as the result of image retrieval.

Steps S4 to S6 are all completed online in real time, the consumed time is very short, and the user experience is good. Table 3 shows the parameter adjusting process of the VM-Net technology fusing GRMAAC similarity and VAMAC similarity in the invention. Fig. 5 selects the first five pictures as the search result, and the correct result is selected in a frame.

TABLE 3 VM-Net fusion parameter adjustment Table

Serial number	q _t	Map@7
			1	0.8	0.4767
2	0.9	0.4775
			3	1.0	0.4832
4	1.1	0.4895
			5	1.2	0.4875
6	1.3	0.4886
			7	1.4	0.4947
8	1.5	0.4930
			9	1.6	0.4952
10	1.7	0.4961
			11	1.8	0.4932
12	1.9	0.4824

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the foregoing embodiments and description only for the purpose of illustrating the principles of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims, specification and equivalents thereof.

Claims

1. A crane image retrieval method based on attention mechanism and feature fusion is characterized by comprising the following steps:

s6, selecting a plurality of similarity degrees which are ranked most forward in the total similarity table, taking the crane pictures in the corresponding databases as output results, and associating the information and technical parameters of the corresponding crane or equipment with the crane pictures in each database for inquiry;

the step S2 comprises the following steps: generating a variable attention mask using a variable attention mechanism, with the variable attention generatedThe method comprises the steps that an intention mask filters convolutional layer characteristics extracted by a main network, information of a crane target is reserved, background information influencing retrieval is filtered, and Max-posing is carried out on the filtered convolutional layer characteristics to obtain VAMAC characteristics; the specific calculation method of the variable attention mechanism comprises the following steps: for the convolutional layer characteristics F (H multiplied by W multiplied by C) output by the backbone network, H, W and C sequentially represent the height, width and channel number of the convolutional layer characteristics; adding the convolution layer features along the dimension of the channel to obtain a sum feature layer A (H multiplied by W), and selecting the minimum pixel value A in the sum feature layer A _min Calculating the sum of each pixel value in the sum feature layer A _min The difference value is subjected to p norm, and the p norm is added and standardized to obtain a mask discrimination threshold T; for the sum feature layer A, judging that the pixel value of the pixel which is larger than the threshold value T is 1; judging the pixel value smaller than the threshold value T to be 0 to obtain a variable attention mask, wherein the pixel with the value of 1 in the mask belongs to the target, and the pixel with the value of 0 belongs to the background; obtaining a mask suitable for task characteristics by adjusting the p value; the step S3 comprises the following steps: for the convolutional layer characteristics extracted from the backbone network, filtering by using a variable attention mask, and then extracting the GRMAAC characteristics by using a GRMAC multi-scale frame pooling method fused with Avg-posing and Lp-posing;

for the GRMAC multi-scale frame pooling method, the MAC pooling method is used for improvement in all scales of the multi-scale pooling frame; for the large-size frame, the Avg-pooling method is adopted; for small-scale frames, the Max-pooling method is still used; adopting an Lp-pooling method for a medium-scale frame between a large-scale frame and a small-scale frame; in the step S2 and the step S3, the process of extracting VAMAC characteristics and GRMAAC characteristics of the crane pictures in the database is performed offline; in the step S4, the process of extracting VAMAC characteristics and GRMAAC characteristics of the crane picture to be inquired is carried out on line in real time.

2. The crane image retrieval method based on attention mechanism and feature fusion as claimed in claim 1, wherein the step S1 comprises: selecting ResNet101 with an ImageNet data set trained in advance as a main network, and fine-tuning the main network by adopting a screened and cleaned products10k data set.

3. The crane image retrieval method based on attention mechanism and feature fusion as claimed in claim 1, wherein in step S6, the database crane images with the highest similarity are selected to be seven, and are arranged in descending order as the result of image retrieval.