CN112396115B

CN112396115B - Attention mechanism-based target detection method and device and computer equipment

Info

Publication number: CN112396115B
Application number: CN202011322670.7A
Authority: CN
Inventors: 张国辉; 杨国青; 宋晨
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2023-12-22
Anticipated expiration: 2040-11-23
Also published as: CN112396115A; WO2021208726A1

Abstract

The invention discloses a target detection method and device based on an attention mechanism and computer equipment, wherein the method comprises the following steps: receiving an image to be detected input by a user; inputting the image to be detected into a convolutional neural network model, and extracting a multi-layer feature map of the image to be detected; weighting the multi-layer feature map according to the attention mechanism to obtain a weighted feature map; generating a feature pyramid of the image to be detected according to the multi-layer feature map; respectively fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid; acquiring a feature map matched with the target image from the fused feature pyramid; and carrying out target detection on the feature images matched with the target images according to the target detection model to obtain the target images. The invention fuses the characteristics of the convolution output layer by introducing a attention mechanism based on the neural network technology in artificial intelligence, thereby greatly improving the precision when different target detection tasks are carried out.

Description

Attention mechanism-based target detection method and device and computer equipment

Technical Field

The present invention relates to the field of target detection technologies, and in particular, to a method and apparatus for detecting a target based on an attention mechanism, and a computer device.

Background

In the existing target detection technology, whether the multi-layer features of the two-stage Faster RCNN are fused or the multi-layer features of the single-stage YOLO are fused, a feature pyramid is adopted to splice the up-sampled high-layer features with the adjacent bottom-layer features for feature fusion. When a detection task of a small target needs to be executed, a large-size feature map in a feature pyramid is needed to detect the target; when a large target detection task needs to be executed, a small-size feature map in a feature pyramid needs to be adopted to detect the target. Although the feature pyramid is adopted to detect the target, the detection accuracy is better, but the accuracy of ideal detection still cannot be satisfied. Therefore, how to improve the accuracy of detection when performing different target detection tasks on the basis of a feature pyramid is a problem to be solved by the present invention.

Disclosure of Invention

The embodiment of the invention provides a target detection method, a target detection device and computer equipment based on an attention mechanism, and aims to solve the problem that detection accuracy in the process of carrying out different target detection tasks based on a feature pyramid cannot meet detection requirements in the prior art.

In a first aspect, an embodiment of the present invention provides a method for detecting an object based on an attention mechanism, including:

receiving an image to be detected input by a user;

inputting the image to be detected into a preset convolutional neural network model, and extracting a multi-layer feature map of the image to be detected;

weighting the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map;

generating a feature pyramid of the image to be detected according to the multi-layer feature map;

respectively fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid;

acquiring a feature map matched with a target image in the image to be detected from the fused feature pyramid;

and carrying out target detection on the feature images matched with the target images according to a preset target detection model to obtain target images in the images to be detected.

In a second aspect, an embodiment of the present invention provides an attention mechanism-based object detection apparatus, including:

the receiving unit is used for receiving the image to be detected input by the user;

the first generation unit is used for inputting the image to be detected into a preset convolutional neural network model and extracting a multi-layer feature map of the image to be detected;

The second generation unit is used for weighting the multi-layer feature images according to a preset attention mechanism to obtain weighted feature images;

a third generating unit, configured to generate a feature pyramid of the image to be detected according to the multi-layer feature map;

the fusion unit is used for respectively fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid;

the acquisition unit is used for acquiring a feature map matched with the target image in the image to be detected from the fused feature pyramid;

and the target detection unit is used for carrying out target detection on the feature images matched with the target images according to a preset target detection model to obtain target images in the images to be detected.

In a third aspect, an embodiment of the present invention further provides a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to implement the method for detecting an object based on an attention mechanism according to the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program when executed by a processor causes the processor to perform the method for detecting an object based on an attention mechanism according to the first aspect.

The embodiment of the invention provides a target detection method and device based on an attention mechanism and computer equipment, wherein the method comprises the following steps: receiving an image to be detected input by a user; inputting the image to be detected into a preset convolutional neural network model, and extracting a multi-layer feature map of the image to be detected; weighting the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map; generating a feature pyramid of the image to be detected according to the multi-layer feature map; respectively fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid; acquiring a feature map matched with a target image in the image to be detected from the fused feature pyramid; and carrying out target detection on the feature images matched with the target images according to a preset target detection model to obtain target images in the images to be detected. By the method, when the target detection task is carried out, the weights of different feature layers can be adjusted in a self-adaptive mode, and meanwhile, the final fusion feature is more suitable for the target detection task, so that the detection precision can be greatly improved under the condition of small additional time expenditure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a method for detecting an object based on an attention mechanism according to an embodiment of the present invention;

FIG. 2 is a schematic sub-flowchart of an attention-based target detection method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of another sub-flowchart of the method for detecting an object based on an attention mechanism according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another sub-flowchart of the method for detecting an object based on an attention mechanism according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of another sub-flowchart of the method for detecting an object based on an attention mechanism according to an embodiment of the present invention;

FIG. 6 is a schematic block diagram of an attention-based object detection device provided by an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a subunit of an attention-based object detection device according to an embodiment of the present invention;

FIG. 8 is a schematic block diagram of another subunit of an attention-based object detection device according to an embodiment of the present invention;

FIG. 9 is a schematic block diagram of another subunit of an attention-based object detection device according to an embodiment of the present invention;

FIG. 10 is a schematic block diagram of another subunit of an attention-based object detection device according to an embodiment of the present invention;

fig. 11 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Referring to fig. 1, fig. 1 is a flowchart of an attention mechanism-based object detection method according to an embodiment of the present invention. The target detection method based on the attention mechanism is built and operated in a server, after receiving an image to be detected sent by intelligent terminal equipment such as a portable computer, a tablet personal computer and the like, the server performs feature extraction on the image to be detected to obtain a multi-layer feature map of the image to be detected, weights the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map, the weighted feature map corresponds to each layer of feature map in the multi-layer feature map, then convolves each layer of feature map in the multi-layer feature map again to obtain a feature pyramid of the image to be detected, finally fuses the weighted feature map with each layer of feature map in the feature pyramid respectively to obtain a fused feature pyramid, and the fused feature pyramid is more suitable for detection of the target image, so that the detection precision can be greatly improved under the condition of small additional time cost.

The method for detecting the target based on the attention mechanism is described in detail below. As shown in fig. 1, the method includes the following steps S110 to S170.

S110, receiving an image to be detected input by a user.

And receiving an image to be detected input by a user. Specifically, the image to be detected includes feature information of the target image, the user sends the image to be detected to a server through a terminal such as a portable computer, a tablet personal computer, a smart phone and other terminal equipment, and the server can execute the target detection method based on the attention mechanism after receiving the image to be detected, so as to obtain a feature pyramid after fusion of the image to be detected, so as to adapt to different target detection tasks.

S120, inputting the image to be detected into a preset convolutional neural network model, and extracting a multi-layer feature map of the image to be detected.

Inputting the image to be detected into a preset convolutional neural network model, and extracting a multi-layer feature map of the image to be detected. Specifically, the convolutional neural network model is a model which is trained in advance and is used for extracting features of an input image to be detected to obtain a multi-layer feature map of the image to be detected, namely, after the image to be detected is input into the convolutional neural network model, the image to be detected sequentially passes through a plurality of convolutional layers, pooling layers and activation function layers, the number of channels of each layer of feature map in the multi-layer feature map from bottom to top is gradually increased, the size of each layer of feature map is gradually reduced, the extracted features of each layer of feature map are sent to the next layer to serve as input, namely, the multi-layer feature map is composed of feature maps of different convolution stages which are passed through after the image to be detected is input into the convolutional neural network model, the richness of semantic information of the feature map from bottom to top of the multi-layer feature map is gradually increased, and the resolution is gradually reduced. The semantic information in the feature map at the bottommost layer in the multi-layer feature map is the least, the resolution is the highest, and the method is not suitable for detecting small targets; the top-most feature map in the multi-layer feature map has the most abundant semantics and the lowest resolution, and is not suitable for detecting a large target. The convolutional neural network may be a VGG (Visual Geometry Group, super-resolution test sequence) convolutional neural network, a depth res net (Residual Networks, residual network), or the like. For example, when the convolution process of the convolutional neural network includes four phases conv1, conv2, conv3 and conv4, extracting a feature map of the last layer of the four phases conv1, conv2, conv3 and conv4, so as to obtain a multi-layer feature map of the image to be detected.

And S130, weighting the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map.

And weighting the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map. In particular, the attention mechanism is essentially similar to the human selective visual attention mechanism, with the core idea being to select from a multitude of information, information that is more critical to the current task objective. The attention mechanism is used for acquiring the weight of each layer of feature map in the multi-layer feature map, after acquiring the weight of each layer of feature map in the multi-layer feature map, the feature values of each layer of feature map in the multi-layer feature map are multiplied by the corresponding weights and added, so that the multi-layer feature map can be weighted, and the weighted feature map is obtained.

In another embodiment, as shown in fig. 2, step S130 includes: substep S131 and substep S132.

S131, obtaining the weight of each layer of characteristic diagrams in the multi-layer characteristic diagrams from the convolutional neural network model according to the attention mechanism.

And acquiring the weight of each layer of characteristic map in the multi-layer characteristic map from the convolutional neural network model according to the attention mechanism. In the embodiment of the invention, the attention mechanism is a spatial attention mechanism, the image to be detected is input into the convolutional neural network model, and after the multi-layer feature map is obtained, each layer of feature map in the multi-layer feature map has a corresponding weight. Since the output of each layer of the multi-layer feature map is real, and the sum of the weights of each layer of the multi-layer feature map is 1. Therefore, after the weight of each layer of feature images in the multi-layer feature images is obtained according to the attention mechanism, the weight of each layer of feature images in the multi-layer feature images can be obtained by normalizing the weight of each layer of feature images, wherein the normalization is to normalize the weight of each layer of feature images to be between (0, 1). In the embodiment of the invention, the attention mechanism is a spatial attention mechanism, and the weight of each layer of feature map in the multi-layer feature map can be obtained by adopting a Sigmoid function to normalize the weight of each layer of feature map.

And S132, weighting the multi-layer feature map according to the weight of each layer of feature map in the multi-layer feature map to obtain the weighted feature map.

And weighting the multi-layer feature map according to the weight of each layer of feature map in the multi-layer feature map to obtain the weighted feature map. Specifically, each layer of features in the multi-layer feature map is acquired through the attention mechanismAnd after the weight of the feature map is given, multiplying the feature value of each layer of feature map in the multi-layer feature map by the corresponding weight, and adding to obtain a feature map with moderate size and semantic information, namely the weighted feature map. The calculation formula of the characteristic value of the weighted characteristic diagram is expressed as follows: f=f ₁ ×w ₁ +f ₂ ×w ₂ +…f _i ×w _i ) Wherein f _i Is the feature value, w, of a certain feature map in the multi-layer feature map _i And weighting a certain characteristic diagram in the multi-layer characteristic diagram.

And S140, generating a feature pyramid of the image to be detected according to the multi-layer feature map.

And generating a feature pyramid of the image to be detected according to the multi-layer feature map. Specifically, the feature pyramid is built from top to bottom through the multi-layer feature map. The feature pyramid may be used for target detection for different tasks. When a small target in the image to be detected needs to be detected, rich semantic information can be obtained by performing target identification only by adopting a large-size feature map in the feature pyramid; when a large target in the image to be detected needs to be detected, abundant semantic information can be obtained by identifying only by adopting a small-size feature map in the feature pyramid.

In another embodiment, as shown in FIG. 3, step S140 includes sub-steps S141 and S142.

S141, according to a preset convolution check, convoluting each layer of feature images in the multi-layer feature images to obtain the convolved multi-layer feature images.

And (3) carrying out convolution on each layer of feature images in the multi-layer feature images according to a preset convolution check to obtain the multi-layer feature images after convolution. Specifically, after each layer of feature map in the multi-layer feature map is convolved by using the convolution kernel, the number of channels of each layer of feature map in the multi-layer feature map is equal, so that a feature pyramid is constructed through the multi-layer feature map. The size of the convolution kernel may be set according to practical situations, and is not limited herein. For example, if each layer of the multi-layer feature map is C1, C2, C3, C4, and C5 in sequence from top to bottom, the convolution kernels of the sizes of 1*1 are used to convolve the C1, C2, C3, C4, and C5 so that the number of channels of the convolved C1, C2, C3, C4, and C5 is equal.

S142, generating a feature pyramid of the image to be detected according to the convolved multi-layer feature map.

And generating a feature pyramid of the image to be detected according to the convolved multi-layer feature map. Specifically, the number of channels of each layer of feature map in the convolved multi-layer feature map is equal, the number of layers of the feature map in the convolved multi-layer feature map is equal to the number of layers of the feature pyramid, and the size of each layer is equal.

In another embodiment, as shown in FIG. 4, step S142 includes sub-steps S1421 and S1422.

S1421, constructing a feature map of the top layer of the feature pyramid according to the feature map of the top layer in the convolved multi-layer feature map.

And constructing the feature map of the top layer of the feature pyramid according to the feature map of the top layer in the convolved multi-layer feature map. Specifically, the top-level feature map in the convolved multi-level feature map has the smallest size and the most abundant semantics in the convolved multi-level feature map, so that the top-level feature map in the convolved multi-level feature map can be directly used as the top-level feature map of the feature pyramid.

S1422, constructing a feature map below the top layer of the feature pyramid according to the feature map of the top layer of the feature pyramid.

And constructing a feature map below the top layer of the feature pyramid according to the feature map of the top layer of the feature pyramid. The specific process of constructing the feature map below the top layer of the feature pyramid through the feature map of the top layer of the feature pyramid is as follows: and sampling the top layer of the feature pyramid and adding the top layer of the feature pyramid with the feature graphs adjacent to the top layer in the convolved multi-layer feature graphs to obtain feature graphs adjacent to the top layer in the feature pyramid, wherein in the adding process, the feature graphs adjacent to the top layer in the convolved multi-layer feature graphs can be added after being reduced to twice of the original one, and the feature pyramid can be constructed by sequentially carrying out top-down operation. For example: taking the convolved C1 as a feature map P1 of the top layer of the feature pyramid, sampling the P1, simultaneously scaling the convolved C2 to be twice as much as the original, adding the feature map sampled by the P1 and the feature map scaled by the convolved C2 to be twice as much as the original to obtain P2 adjacent to the P1 in the feature pyramid, and so on, so that the feature maps in the feature pyramid are obtained in sequence from top to bottom: p1, P2, P3, P4, P5.

And S150, respectively fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid.

And respectively fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid. Specifically, when the image to be detected is convolved in the convolutional neural network model, the convolved object of the convolutional neural network model is a set of multidimensional matrix, and the image to be detected is input into the convolutional neural network model, each layer of feature map in the obtained multi-layer feature map is a set of multidimensional matrix, each layer of feature map in the feature pyramid constructed according to the multi-layer feature map, the feature values of each layer of feature map in the multi-layer feature map are multiplied by the corresponding weights and added, and the obtained weighted feature map is a set of multidimensional matrix, so that the corresponding matrixes are added in the process of respectively fusing the weighted feature map with each layer of feature map in the feature pyramid, namely, the weighted feature map is spliced with each layer of feature map in a head-tail mode, and a set of new multidimensional matrix is obtained as the fused feature pyramid. Each layer of feature map in the fused feature pyramid contains more abundant semantic information than the corresponding feature map in the feature pyramid, and the accuracy of target detection can be greatly improved when the targets of different tasks are detected.

S160, acquiring a feature map matched with the target image in the image to be detected from the fused feature pyramid.

And acquiring a feature map matched with the target image in the image to be detected from the fused feature pyramid. Specifically, a feature map matched with the target image in the image to be detected is obtained from the fused feature pyramid according to the target size of the target image in the image to be detected. Generally, when sending the image to be detected, the user also sends instruction information of a detection request for carrying out target detection on the image to be detected, the target size of the target image in the image to be detected can be obtained according to the instruction information, the target size can select a feature map which accords with target detection from the fused feature pyramid, and then the feature map is input into a pre-trained target detection model to obtain the target image.

And S170, performing target detection on the feature images matched with the target images according to a preset target detection model to obtain target images in the images to be detected.

And carrying out target detection on the feature images matched with the target images according to a preset target detection model to obtain target images in the images to be detected. Specifically, the target detection model is a model for extracting a plurality of rectangular bounding boxes from the feature images matched with the target image in the image to be detected, the plurality of rectangular bounding boxes are the plurality of candidate boxes, after the feature images matched with the target image in the image to be detected are input into the target detection model, the target detection model outputs the plurality of candidate boxes, wherein the plurality of candidate boxes comprise target detection boxes, the plurality of candidate boxes are candidate boxes related to the target image in the image to be detected, and the plurality of candidate boxes comprise the feature information of part or all of the target images, so that the target image in the image to be detected is obtained.

In another embodiment, as shown in FIG. 5, step S170 includes sub-steps S171 and S172.

S171, inputting the feature images matched with the target image in the image to be detected into a preset area generating network model to obtain a plurality of candidate frames.

And inputting the feature images matched with the target image in the image to be detected into a preset area generating network model to obtain a plurality of candidate frames. Specifically, the area generating network model is trained in advance and is used for extracting a feature map matched with a target image in the image to be detected to obtain a model of a plurality of candidate frames containing a target detection frame, after the feature map matched with the target image in the image to be detected is input into the area generating network model, the size of the candidate frames containing the target detection frame is generated by taking an anchor point of a sliding window with a preset size as a center through size transformation, and in the embodiment of the invention, the size of the sliding window is 3×3.

And S172, screening the target detection frame from the candidate frames according to a preset non-maximum suppression algorithm to obtain the target image.

And screening the target detection frame from the plurality of candidate frames according to a preset non-maximum suppression algorithm to obtain the target image. Specifically, the non-maximum suppression algorithm is simply called an NMS algorithm, and is commonly used for edge detection, face detection, target detection and the like in computer vision. In this embodiment, the non-maximum suppression algorithm is used to perform target detection on the image to be detected. Since a large number of candidate frames are generated at the same target position in the target detection process, the candidate frames may overlap with each other, and the target detection frame needs to be found from the plurality of candidate frames through a non-maximum suppression algorithm. And when the region generating network model outputs the plurality of candidate frames, simultaneously outputting the confidence coefficient of each candidate frame in the plurality of candidate frames, wherein the confidence coefficient is the probability of the target image in each candidate frame in the plurality of candidate frames, and the non-maximum suppression algorithm screens according to the confidence coefficient of each candidate frame in the plurality of candidate frames to obtain the target detection frame. The specific flow of the non-maximum suppression algorithm is as follows: firstly, sorting the candidate frames according to the order of the confidence coefficient of each candidate frame in the plurality of candidate frames from high to low, removing the candidate frames with the confidence coefficient smaller than a preset first threshold value, calculating the area of each candidate frame in the candidate frames which are not removed, then respectively calculating IoU of the candidate frame with the highest confidence coefficient in the candidate frames which are not removed and the rest candidate frames which are not removed, judging whether the calculated IoU exceeds a preset second threshold value, if so, removing the rest candidate frames which are not removed and are calculated by IoU on the candidate frame with the highest confidence coefficient in the candidate frames which are not removed, and finally obtaining the target detection frame, wherein the target image can be obtained through the target detection frame. The IoU is a concept used in the target detection, and represents the overlapping rate or overlapping degree of the candidate frame and the original mark frame, that is, the ratio of the intersection of the candidate frame and the original mark frame to the union. In this embodiment, the preset first threshold is set to 0.3, and the preset second threshold is set to 0.5.

In the target detection method based on the attention mechanism, which is provided by the embodiment of the invention, an image to be detected input by a user is received; inputting the image to be detected into a preset convolutional neural network model, and extracting a multi-layer feature map of the image to be detected; weighting the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map; generating a feature pyramid of the image to be detected according to the multi-layer feature map; respectively fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid; acquiring a feature map matched with a target image in the image to be detected from the fused feature pyramid; and carrying out target detection on the feature images matched with the target images according to a preset target detection model to obtain target images in the images to be detected. By the method, when the target detection task is carried out, different feature layer weights can be adjusted in a self-adaptive mode, so that the final fusion features are more suitable for the detection task, and the detection precision can be greatly improved under the condition of small additional time cost.

The embodiment of the present invention further provides an attention mechanism based object detection device 100 for performing any of the embodiments of the aforementioned attention mechanism based object detection method. In particular, referring to fig. 6, fig. 6 is a schematic block diagram of an attention-based object detection apparatus 100 according to an embodiment of the present invention.

As shown in fig. 6, the attention mechanism-based object detection device 100 includes a receiving unit 110, a first generating unit 120, a second generating unit 130, a third generating unit 140, a fusion unit 150, an acquisition unit 160, and an object detection unit 170.

And a receiving unit 110 for receiving an image to be detected inputted by a user.

The first generating unit 120 is configured to input the image to be detected into a preset convolutional neural network model, and extract a multi-layer feature map of the image to be detected.

And the second generating unit 130 is configured to weight the multi-layer feature map according to a preset attention mechanism, so as to obtain a weighted feature map.

In other inventive embodiments, as shown in fig. 7, the second generating unit 130 includes a weight acquiring unit 131 and a fourth generating unit 132.

A weight obtaining unit 131, configured to obtain the weight of each layer of the multi-layer feature map from the convolutional neural network model according to the attention mechanism.

And a fourth generating unit 132, configured to weight the multi-layer feature map according to the weight of each layer of feature map in the multi-layer feature map, so as to obtain the weighted feature map.

And a third generating unit 140, configured to generate a feature pyramid of the image to be detected according to the multi-layer feature map.

In other inventive embodiments, as shown in fig. 8, the third generating unit 140 includes: a convolution unit 141 and a fifth generation unit 142.

And the convolution unit 141 is configured to perform convolution on each layer of feature maps in the multi-layer feature map according to a preset convolution check, so as to obtain a multi-layer feature map after convolution.

A fifth generating unit 142, configured to generate a feature pyramid of the image to be detected according to the convolved multi-layer feature map.

In other inventive embodiments, as shown in fig. 9, the fifth generating unit 142 includes: a first building element 1421 and a second building element 1422.

A first building unit 1421, configured to build a feature map of the top layer of the feature pyramid according to the feature map of the top layer of the convolved multi-layer feature map.

A second building unit 1422, configured to build a feature map under the top layer of the feature pyramid according to the feature map of the top layer of the feature pyramid.

And the fusion unit 150 is configured to fuse the weighted feature graphs with each layer of feature graphs in the feature pyramid respectively, so as to obtain a fused feature pyramid.

An obtaining unit 160, configured to obtain a feature map matched with a target image in the image to be detected from the fused feature pyramid;

and the target detection unit 170 is configured to perform target detection on the feature map matched with the target image according to a preset target detection model, so as to obtain a target image in the image to be detected.

In other inventive embodiments, as shown in fig. 10, the object detection unit 170 includes: a sixth generation unit 171 and a screening unit 172.

The sixth generating unit 171 is configured to input the feature map matched with the target image in the image to be detected into a preset area generating network model, so as to obtain a plurality of candidate boxes.

And a screening unit 172, configured to screen the target detection frame from the multiple candidate frames according to a preset non-maximum suppression algorithm, so as to obtain the target image.

The attention mechanism-based object detection device 100 provided by the embodiment of the present invention is configured to perform the above-described image to be detected for receiving user input; inputting the image to be detected into a preset convolutional neural network model, and extracting a multi-layer feature map of the image to be detected; weighting the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map; generating a feature pyramid of the image to be detected according to the multi-layer feature map; respectively fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid; acquiring a feature map matched with a target image in the image to be detected from the fused feature pyramid; and carrying out target detection on the feature images matched with the target images according to a preset target detection model to obtain target images in the images to be detected.

Referring to fig. 11, fig. 11 is a schematic block diagram of a computer device according to an embodiment of the present invention.

With reference to FIG. 11, the device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform an attention mechanism based object detection method.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall device 500.

The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform an attention mechanism based object detection method.

The network interface 505 is used for network communication, such as providing for transmission of data information, etc. It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the apparatus 500 to which the present inventive arrangements are applied, and that a particular apparatus 500 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor 502 is configured to execute a computer program 5032 stored in a memory to perform the following functions: receiving an image to be detected input by a user; inputting the image to be detected into a preset convolutional neural network model, and extracting a multi-layer feature map of the image to be detected; weighting the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map; generating a feature pyramid of the image to be detected according to the multi-layer feature map; respectively fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid; acquiring a feature map matched with a target image in the image to be detected from the fused feature pyramid; and carrying out target detection on the feature images matched with the target images according to a preset target detection model to obtain target images in the images to be detected.

Those skilled in the art will appreciate that the embodiment of the apparatus 500 shown in fig. 11 is not limiting of the specific construction of the apparatus 500, and in other embodiments, the apparatus 500 may include more or less components than illustrated, or certain components may be combined, or a different arrangement of components. For example, in some embodiments, the device 500 may include only the memory and the processor 502, and in such embodiments, the structure and the function of the memory and the processor 502 are consistent with the embodiment shown in fig. 11, and will not be described herein.

It should be appreciated that in an embodiment of the invention, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors 502, digital signal processors 502 (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor 502 may be the microprocessor 502 or the processor 502 may be any conventional processor 502 or the like.

In another embodiment of the invention, a computer storage medium is provided. The storage medium may be a non-volatile computer readable storage medium. The storage medium stores a computer program 5032, wherein the computer program 5032 when executed by the processor 502 performs the steps of: receiving an image to be detected input by a user; inputting the image to be detected into a preset convolutional neural network model, and extracting a multi-layer feature map of the image to be detected; weighting the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map; generating a feature pyramid of the image to be detected according to the multi-layer feature map; respectively fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid; acquiring a feature map matched with a target image in the image to be detected from the fused feature pyramid; and carrying out target detection on the feature images matched with the target images according to a preset target detection model to obtain target images in the images to be detected.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, device and unit described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein. Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logical function division, there may be another division manner in actual implementation, or units having the same function may be integrated into one unit, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units may be stored in a storage medium if implemented in the form of software functional units and sold or used as stand-alone products. Based on such understanding, the technical solution of the present invention may be essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing an apparatus 500 (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. An attention mechanism-based target detection method is characterized by comprising the following steps:

receiving an image to be detected input by a user;

acquiring the weight of each layer of feature map in the multi-layer feature map from the convolutional neural network model according to a preset attention mechanism;

multiplying the characteristic values of each layer of characteristic diagrams in the multi-layer characteristic diagrams by the corresponding weights, and then adding to obtain weighted characteristic diagrams, wherein the calculation formula of the characteristic values of the weighted characteristic diagrams is expressed as follows: f=f1×w1+f2×w2+ … fi×wi, where fi is a feature value of a certain feature map in the multilayer feature map, and wi is a weight of the certain feature map in the multilayer feature map;

According to a preset convolution check, convoluting each layer of feature images in the multi-layer feature images to obtain convolved multi-layer feature images;

constructing a feature map of the top layer of the feature pyramid according to the feature map of the top layer in the convolved multi-layer feature map;

sampling the top layer of the feature pyramid and adding the top layer of the feature pyramid with the feature images adjacent to the topmost layer in the convolved multi-layer feature images to obtain feature images below the top layer in the feature pyramid, wherein in the process of sampling the top layer of the feature pyramid and adding the top layer of the convolved multi-layer feature images adjacent to the topmost layer, the feature images adjacent to the topmost layer in the convolved multi-layer feature images are reduced to be twice of the original feature images and then added, and the feature images are sequentially added from top to bottom to construct the feature pyramid;

inputting a feature map matched with the target image into a preset area generation network model to obtain a plurality of candidate frames, wherein the area generation network model is trained in advance and is used for extracting the feature map matched with the target image in the image to be detected to obtain a model of the plurality of candidate frames containing a target detection frame, inputting the feature map matched with the target image in the image to be detected into the area generation network model, and generating the plurality of candidate frames containing the target detection frame through size transformation by taking an anchor point of a sliding window with a preset size as the center;

And screening the target detection frame from the plurality of candidate frames according to a preset non-maximum suppression algorithm to obtain the target image, wherein when the region generation network model outputs a plurality of candidate frames, the confidence degree of each candidate frame in the plurality of candidate frames is simultaneously output, the confidence degree is the probability of the target image in each candidate frame in the plurality of candidate frames, and the non-maximum suppression algorithm screens according to the confidence degree of each candidate frame in the plurality of candidate frames to obtain the target detection frame.

2. The method for detecting an object based on an attention mechanism according to claim 1, wherein the fusing the weighted feature graphs with each layer of feature graphs in the feature pyramid to obtain a fused feature pyramid includes:

and respectively splicing the weighted feature graphs with each layer of feature graphs in the feature pyramid end to obtain the fused feature pyramid.

3. The method for detecting an object based on an attention mechanism according to claim 1, wherein the obtaining, from the fused feature pyramid, a feature map matching with an object image in the image to be detected includes:

And acquiring a feature map matched with the target image in the image to be detected from the fused feature pyramid according to the target size of the target image in the image to be detected.

4. An attention mechanism-based object detection apparatus, comprising:

a weight obtaining unit 131, configured to obtain a weight of each layer of feature maps in the multi-layer feature map from the convolutional neural network model according to a preset attention mechanism;

a fourth generating unit 132, configured to multiply the feature values of each layer of feature graphs in the multi-layer feature graphs by the corresponding weights, and then add the feature values to obtain a weighted feature graph, where a calculation formula of the feature values of the weighted feature graph is expressed as: f=f1×w1+f2×w2+ … fi×wi, where fi is a feature value of a certain feature map in the multilayer feature map, and wi is a weight of the certain feature map in the multilayer feature map;

the convolution unit is used for carrying out convolution on each layer of characteristic images in the multi-layer characteristic images according to a preset convolution check to obtain the multi-layer characteristic images after convolution;

A first construction unit, configured to construct a feature map of a top layer of the feature pyramid according to a feature map of a top layer of the convolved multi-layer feature map;

the second construction unit is used for sampling the top layer of the feature pyramid and adding the top layer of the feature pyramid with the feature images adjacent to the top layer in the convolved multi-layer feature images to obtain feature images below the top layer in the feature pyramid, wherein in the process of sampling the top layer of the feature pyramid and adding the top layer of the convolved multi-layer feature images adjacent to the top layer, the feature images adjacent to the top layer in the convolved multi-layer feature images are reduced to be twice of the original feature images and added, and the feature images are sequentially added from top to bottom to construct the feature pyramid;

a sixth generating unit, configured to input a feature map matched with the target image into a preset area generating network model to obtain a plurality of candidate frames, where the area generating network model is trained in advance and is used to extract the feature map matched with the target image in the image to be detected, obtain a model of a plurality of candidate frames including a target detection frame, input the feature map matched with the target image in the image to be detected into the area generating network model, and generate a plurality of candidate frames including a target detection frame through size transformation with an anchor point of a sliding window of a preset size as a center;

The filtering unit is used for filtering the target detection frame from the plurality of candidate frames according to a preset non-maximum suppression algorithm to obtain the target image, wherein when the region generation network model outputs a plurality of candidate frames, the confidence degree of each candidate frame in the plurality of candidate frames is simultaneously output, the confidence degree is the probability of the target image in each candidate frame in the plurality of candidate frames, and the non-maximum suppression algorithm is used for filtering according to the confidence degree of each candidate frame in the plurality of candidate frames to obtain the target detection frame.

5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the attention-based objective detection method as claimed in any one of claims 1 to 3 when the computer program is executed.

6. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the attention-based object detection method according to any one of claims 1 to 3.