CN112016569A

CN112016569A - Target detection method, network, device and storage medium based on attention mechanism

Info

Publication number: CN112016569A
Application number: CN202010727998.0A
Authority: CN
Inventors: 任豪; 郭钰
Original assignee: Yushi Technology Nanjing Co ltd
Current assignee: Uisee Technology Zhejiang Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-12-01

Abstract

The application relates to the technical field of image processing, and discloses a target detection method, a network, equipment and a storage medium based on an attention mechanism. The method comprises the following steps: extracting the characteristics of an image to be detected through a characteristic extraction network to obtain a first characteristic diagram; processing the first characteristic diagram through a regional generation network to obtain at least one candidate region; processing the first feature map and the candidate area through the ROI pooling network to obtain a second feature map; respectively processing the second feature map through a classification network and a regression network to obtain a category score and bounding box information of at least one target; at least one of the feature extraction network, the classification network, and the regression network comprises an attention subnetwork; the attention subnetwork determines a previous layer and a temporary layer; determining an attention weight based on the previous layer and the temporary layer; the prior layer and the attention weight are multiplied to add attention to the prior layer. By the technical scheme, the speed and the accuracy of target detection are improved.

Description

Target detection method, network, device and storage medium based on attention mechanism

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a target detection method, a network, a device, and a storage medium based on an attention mechanism.

Background

Object detection is a technique for identifying objects and their positions in an image, which uses both a bounding box to locate objects in the image (determine object positions) and a class to which each object belongs (image classification). Target detection is the basis of many computer vision tasks, and has wide application prospects in the fields of automatic driving, face recognition, behavior recognition, target counting and the like.

At present, mainstream target detection algorithms are mainly based on a deep learning model and can be divided into two categories: one type is a one-stage detection algorithm. The algorithm directly generates the class probability and the coordinate position of the target without generating a candidate region. Another class is the two-stage detection algorithm. The algorithm divides the detection problem into two stages, firstly generates a candidate region, and then carries out classification and frame regression based on the candidate region to obtain the class probability and the coordinate position of the target.

The main problems of the above various target detection models are as follows: (1) inaccurate target positioning; (2) the generalization capability of the model (the adaptability of the model to new samples) is not strong, so that the classification is wrong; (3) the speed is relatively slow.

Disclosure of Invention

To solve the above technical problem or at least partially solve the above technical problem, the present application provides an attention-based target detection method, a network, a device and a storage medium.

In a first aspect, the present application provides a method for target detection based on an attention mechanism, comprising:

extracting the characteristics of an image to be detected through a characteristic extraction network to obtain a first characteristic diagram;

processing the first feature map through a region generation network to obtain at least one candidate region;

processing the first feature map and the at least one candidate region through an ROI pooling network to obtain a second feature map, wherein the second feature map comprises the at least one candidate region;

respectively processing the second feature map through a classification network and a regression network to obtain the category score and the bounding box information of at least one target;

wherein at least one of the feature extraction network, the classification network, and the regression network comprises an attention subnetwork;

wherein the attention subnetwork is configured to: determining a previous layer, wherein the previous layer is a feature map to be configured with attention weights; determining a temporary layer; determining an attention weight based on the previous layer and the temporary layer; and multiplying the prior layer and the attention weight to realize the attention increase of the prior layer.

In a second aspect, the present application provides an attention-based target detection network comprising:

a feature extraction network, a region generation network, an ROI pooling network, a classification network and a regression network;

the attention subnetwork is used for determining a previous layer, and the previous layer is a feature map to be configured with attention weights; determining a temporary layer; determining an attention weight based on the previous layer and the temporary layer; multiplying the prior layer and the attention weight to realize the attention increase of the prior layer;

the characteristic extraction network is used for extracting the characteristics of the image to be detected to obtain a first characteristic diagram;

the area generation network is used for processing the first feature map to obtain at least one candidate area;

the ROI pooling network is used for processing the first feature map and the at least one candidate region to obtain a second feature map, and the second feature map comprises the at least one candidate region;

the classification network is used for processing the second feature map to obtain a category score of at least one target;

and the regression network is used for processing the second feature map to obtain the bounding box information of the at least one target.

In a third aspect, the present application provides an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement any of the embodiments of the attention-based mechanism object detection method described above.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the embodiments of the above-described attention-based mechanism object detection method.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

firstly, an attention subnetwork is added to at least one of a feature extraction network, a classification network and a regression network in a target detection network, so that the target detection network is more focused on at least one of features, categories and positions of targets needing attention in an image, and the target classification precision and the target positioning precision are improved, and the target detection precision is improved.

And secondly, the added attention sub-network only needs to add the operation of determining the attention weight and multiplying the prior layer by the attention weight on the basis of the original network, and the calculation amount and the calculation time of the network are not increased. On the contrary, the increase of the attention sub-network enables the target detection network to more rapidly focus on the characteristics, the types and the positions of the targets, saves the training time and the detection time of the network to a certain extent, and improves the speed of target detection.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a block diagram of a network structure of an attention mechanism-based target detection network according to an embodiment of the present application;

FIG. 2 is a block diagram of a network structure of an attention subnetwork provided in an embodiment of the present application;

FIG. 3 is a block diagram of a network structure of a multi-scale feature extraction network according to an embodiment of the present disclosure;

fig. 4 is a network structure block diagram of an area generation network according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of a network structure of a classification network and a regression network with an added attention subnetwork provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 7 is a flowchart of an attention mechanism-based target detection method according to an embodiment of the present application.

Fig. 8 is a flowchart of a process of an attention subnetwork provided in an embodiment of the present application.

Detailed Description

In order that the above-mentioned objects, features and advantages of the present application may be more clearly understood, the solution of the present application will be further described below. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced in other ways than those described herein; it is to be understood that the embodiments described in this specification are only some embodiments of the present application and not all embodiments.

Object detection requires that as many objects as possible be detected correctly in as short a time as possible. However, most of two-stage detection algorithms of the current target detection algorithm have large calculation amount and low running speed; although the detection speed of most one-stage detection algorithms is high, the classification and positioning accuracy is low. Namely, the target detection effect is not ideal due to the current target detection algorithm. In view of the problem, an embodiment of the present application provides an attention mechanism-based target detection scheme, which implements adding an attention sub-network to at least one of a feature extraction network, a classification network, and a regression network in a target detection network, and improves target detection accuracy while ensuring detection speed.

For a certain target detection network, in some embodiments, an attention subnetwork may be added to a feature extraction network, a classification network, or a regression network in the target detection network, so as to obtain a target detection network based on an attention mechanism; in some embodiments, an attention subnetwork may also be added to any two of the feature extraction network, the classification network, and the regression network, resulting in an attention mechanism-based target detection network; in some embodiments, an attention subnetwork may be added to each of the feature extraction network, the classification network, and the regression network, resulting in an attention mechanism-based target detection network. Aiming at adding an attention subnetwork in the feature extraction network, the feature extraction network can learn to detect and extract the current target on a more proper scale, so that the feature extraction network focuses more on the current target in the image, and the accuracy of feature extraction is improved. Aiming at adding an attention subnetwork in the classification network, the classification network can learn to pay more attention to features more important for class identification, so that the class of the current target is determined more accurately, and the classification accuracy is improved. Aiming at adding an attention subnetwork in the regression network, the regression network can learn the position where the current target appears, and therefore the accuracy of target positioning is improved.

In addition, the attention subnetwork in the application determines the attention weight based on the previous layer and the temporary layer, and performs the multiplication operation on the previous layer and the attention weight, so as to increase the attention of the previous layer, only a small amount of operation is added on the basis of the original target detection network, but the operation amount and the operation time of the added operation can be ignored relative to the whole network, and the detection speed of the network cannot be slowed down. On the contrary, the increase of the attention sub-network enables the target detection network to more rapidly focus on the characteristics, categories and positions of the targets needing attention in the image, saves the training time and the detection time of the network to a certain extent, and improves the speed of target detection.

The target detection scheme based on the attention mechanism provided by the embodiment of the application can be applied to various application scenes needing target detection. In some embodiments, the method can be applied to an automatic driving scene, and is used for detecting other objects such as automobiles, pedestrians, traffic signals and the like, so that an important basic technology support is provided for an automatic driving system; in some embodiments, the method can be applied to the fields of face detection and face recognition, such as a coded lock for a mobile terminal, a retouching software and the like; in some embodiments, the method can be applied to a behavior recognition scenario for detecting a target needing to recognize a behavior; in some embodiments, the method can be applied to the field of target counting, such as the scenes of analyzing storage performance or estimating human flow and the like.

Fig. 1 is a block diagram of a network structure of an attention mechanism-based target detection network according to an embodiment of the present application. As shown in FIG. 1, an attention-based target detection network 100 includes a feature extraction network 110, a region generation network 120, an ROI pooling network 130, a classification network 140, and a regression network 150; wherein at least one of the feature extraction network 110, the classification network 140, and the regression network 150 includes an attention subnetwork 160.

An attention subnetwork 160 for determining a previous layer, the previous layer being a feature map to be configured with attention weights; determining a temporary layer; determining an attention weight based on the previous layer and the temporary layer; and multiplying the prior layer and the attention weight to realize the attention increase of the prior layer. Specifically, the attention subnetwork 160 in the present application realizes attention increase by adding weights to information that needs to be focused on in the network, and in order to ensure reliability of the weights, the weights are set to be calculated by a feature map in the network. Therefore, the attention subnetwork 160 has to select a feature map containing information of interest, i.e. the previous layer to be configured with attention weights, and has to select a temporary layer for calculating the weights. Then, through various operations between the previous layer and the temporary layer, attention weights can be obtained. Finally, applying the attention weight to the previous layer adds attention to the previous layer.

See fig. 2 for a block diagram of a network architecture of attention subnetwork 160. In some embodiments, after the attention subnetwork 160 selects the previous layer 161, the temporary layer 162 is determined by performing a convolution operation on the previous layer 161. Thus, the temporary layer is also from the characteristic diagram in the network, thereby avoiding introducing external errors and further ensuring the reliability of subsequent weights. In some embodiments, attention subnetwork 160 determines that after prior layer 161 and temporary layer 162, an add operation is performed on prior layer 161 and temporary layer 162; the feature map obtained by the addition operation is convolved and sigmoid activated to obtain the attention weight 163 matching the previous layer 161. The attention weight 163 represents the attention of the network, and a place with a large weight is a point of attention concentration, and the probability that the place is a target increases. Finally, the prior layer 161 and the attention weight 163 are multiplied to obtain the prior layer 164 with increased attention. In this embodiment, the attention subnetwork 160 only adds a few convolution layers and sigmoid layers, so that the added calculation amount is negligible compared with the original network, and the detection speed of the original network is maintained.

In some embodiments, the attention subnetwork 160 unifies the scale of the previous layer and the scale of the temporary layer before determining the attention weight based on the previous layer and the temporary layer, and the unification of the scales may be by upsampling a layer with a smaller scale or by downsampling a layer with a larger scale. If the scales of the previous layer 161 and the temporary layer 162 are not uniform, the subsequent addition operation cannot be performed, and therefore the scales of the previous layer 161 and the temporary layer 162 can be uniform by means of upsampling or downsampling. In some embodiments, it may be selected to unify the scale of the temporary layer 162 to the scale 161 of the previous layer, given that the previous layer 161 is a feature map for subsequent target detection, while the temporary layer 162 is just an intermediate transition layer for calculating attention weights.

The feature extraction network 110 is configured to extract features of an image to be detected to obtain a first feature map. Specifically, the feature extraction network 110 is composed of a plurality of convolutional layers, active layers, pooling layers, and the like, and the number and the positions of the active layers and the pooling layers may be determined according to business requirements. In some embodiments, it may be a repeating combination of multiple sets of convolutional, active, and pooling layers; in some embodiments, it may also be a stacked pooling layer after repeated combination of multiple sets of convolutional layers and active layers; in some embodiments, it may also be a plurality of convolutional layers followed by an activation layer and a pooling layer. The input of the feature extraction network 110 is an image to be detected, feature extraction is performed on the image by a first convolution layer to generate an abstract description of the image to be detected, then each convolution layer further generates an abstract description by using a feature map obtained by the previous network to continuously learn edge information, complex shape information and the like in the image, and the finally obtained convolution feature layer is called as a first feature map. The space dimension of the first feature map is different from that of the image to be detected, the information content of the first feature map in the dimensions of length and width is smaller than that of the image to be detected, but the information content of the first feature map in the dimension of depth is larger than that of the image to be detected. For example, the length, width, and depth dimension information of the image to be detected is 512 × 32, and after the processing by the feature extraction network 110, the length, width, and depth dimension information of the first feature map is 32 × 512, and the depth information is increased.

In some embodiments, the feature extraction network 110 extracts a plurality of first feature maps with different scales. See fig. 3 for a block diagram of a network structure of a multi-scale feature extraction network. In this embodiment, the feature extraction network 110 performs feature extraction from top to bottom on an image to be detected by using a feature pyramid structure, to obtain initial feature maps L1, L2, L3, and L4, and performs feature extraction from bottom to top on L2, L3, and L4 in the four layers of initial feature maps to obtain three intermediate feature maps. Then, based on the three intermediate feature maps and the initial feature map L4, corresponding four first feature maps F1, F2, F3 and F4 at different scales are extracted, and features are further extracted from the first feature map F4, so that a first feature map F5 at another scale is obtained. For each detection scale, subsequent classification and bounding box regression can be directly carried out, target detection can be carried out on different scales, the detection speed of the network is ensured, meanwhile, the characteristic information of the target detection can be enriched, and therefore the accuracy of the target detection is further improved. It should be noted that fig. 3 is only one example of a network structure of the multi-scale feature extraction network, and may also be other feature extraction network structures.

In some embodiments, an attention subnetwork 160 can be added to the feature extraction network 110 to increase the attention of the network to important features in the feature extraction process, thereby making the extracted features more accurate. In some embodiments, a multi-scale feature extraction network may be combined with an attention subnetwork, i.e., adding an attention subnetwork 160 to the first feature network of at least one scale obtained by the feature extraction network 110. This enables the feature extraction network 110 to learn the weight of each scale, i.e. which scale the target should be detected on, so as to more quickly focus on the extraction of the features of the corresponding scale in the detection process, thereby improving the accuracy and speed of target detection.

Referring to fig. 3, attention subnetwork 160 can be added to at least one of first feature maps F1, F2, F3, F4, and F5. Take the example of adding attention subnetwork 160 to one dimension. In some embodiments, attention subnetwork 160 selects a first feature map of a first scale as the previous layer and a first feature map of a second scale, smaller than the first scale, as the temporary layer. For example, the first feature map F1 may be selected as a previous layer, and one of the first feature maps F2, F3, F4, and F5 may be selected as a temporary layer. Therefore, in the process of determining the attention weight, the temporary layer can be subjected to upsampling operation, and loss of characteristic information is avoided. After that, the attention subnetwork 160 processes to obtain the first feature map F1 corresponding to the first feature map with increased attention, which can be used for the subsequent target detection process. In some embodiments, attention subnetwork 160 selects the first feature map of any scale as the previous layer; and the process of determining the temporary layer is: and selecting at least two scales of first feature maps which are smaller than the scale of the previous layer from the residual scales, performing convolution operation on the selected first feature maps to obtain a combined feature map, and using the combined feature map as a temporary layer. For example, first profile F1 is selected as the previous layer, and first profiles F2, F3, F4, and F5 are selected for merging, resulting in a temporary layer. After that, the attention subnetwork 160 processes to obtain the first feature map F1 corresponding to the first feature map with increased attention, which can be used for the subsequent target detection process.

The area generating network 120 is configured to process the first feature map to obtain at least one candidate area. In some embodiments, the feature extraction network 110 outputs a first feature map at a plurality of scales, and for each scale a candidate region of the corresponding first feature map may be generated by the region generation network 120. In some embodiments, referring to the network structure block diagram of the region generation network shown in fig. 4, the input of the region generation network 120 is the first feature map output by the feature extraction network 110, and then the information of the candidate regions ROI is obtained through a convolution layer and a sliding window layer. The number and size of the output candidate regions ROI are related to an anchor point (anchor) preset in the sliding window. K anchor points with different size ratios, such as preset length-width ratios of 1:2 and 1:5, are set for each sliding window in the sliding window layer in the area generation network 120, so that k candidate areas appear at the position covered by each sliding window. In some embodiments, the output of the region generation network 120 is an N × 5 matrix, where N represents the number of ROIs, the first column in the matrix represents the ROI identification (e.g., number), and the last four columns represent the coordinate information of the ROIs, which is the absolute coordinate corresponding to the image to be detected. In some embodiments, the coordinate information may be represented as coordinates of two diagonal points of the ROI, such as an upper left corner coordinate and a lower right corner coordinate, or an upper right corner coordinate and a lower left corner coordinate. In some embodiments, the coordinate information may also be expressed as a center coordinate of the ROI and a width w and a height h of the ROI.

In some embodiments, in order to train the area generation network 120 better, the sample candidate areas need to be screened, and each sample candidate area is divided into a positive sample that is a target, a negative sample that is not a target, and an invalid sample that cannot be determined to be a target. For example, the actual bounding box of the target is used as a bounding box true value (ground true), and a ratio of intersection to union of the bounding box corresponding to the anchor point and the actual bounding box, i.e., an intersection ratio (IoU), is used as a sample division basis. Dividing a sample candidate area with the highest overlap-to-average ratio (IoU) with the actual bounding box and a sample candidate area with the overlap with the actual bounding box exceeding 0.7IoU into positive samples (or positive anchor points), so that a group route can correspond to a plurality of positive samples; dividing the sample candidate area with the overlap with the actual bounding box of all the targets less than 0.3IoU into negative samples (or negative anchor points); all of the remaining sample candidate regions are divided into invalid samples, and do not participate in the training of the region generation network 120.

And the ROI pooling network 130 is configured to process the first feature map and the at least one candidate region to obtain a second feature map, where the second feature map includes the at least one candidate region. In particular, ROI pooling is one of pooling, which is a pooling operation for a plurality of candidate region ROIs, and is a process of extracting local feature maps corresponding to different candidate region ROIs in a first feature map and scaling to a uniform size. Therefore, the input to the ROI pooling network 130 is the output of the feature extraction network 110 and the region generation network 120, i.e., the first feature map and the plurality of candidate regions. The processing of the ROI pooling network 130 is: first, each candidate region ROI information is mapped into a first feature map. In some embodiments, if the scales of the candidate region and the first feature map are not consistent, considering that the ROI coordinate information is an absolute coordinate in the image to be detected, the coordinates of the first feature map may be scaled according to a proportional relationship between the image to be detected and the first feature map so that the scales of the coordinates of the candidate region and the first feature map are consistent. Then, the feature map of a fixed size (w × h), i.e., the second feature map, is generated for each candidate region ROI by the maximum pooling. The maximum pooling is to divide the area equally and then take the maximum value of each patch.

In some embodiments, the second feature map may be directly input into the subsequent classification network 140 and regression network 150 for classification and bounding box regression processing. In some embodiments, a fully-connected layer may be added after the ROI-pooling network 130, and the second feature map may be processed to generate a fully-connected layer. The fully connected layer is input to the subsequent classification network 140 and regression network 150, and classification and bounding box regression processing is performed.

And the classification network 140 is used for processing the second feature map to obtain a category score of at least one target. In some embodiments, the classification network 140 calculates a class to which the second feature map corresponding to each candidate region belongs through a network structure such as a full connection layer, and outputs a prediction probability that the second feature map belongs to each class, that is, a classification probability vector. In some embodiments, classification network 140 is a two-classification process that outputs two prediction probabilities for each candidate region: a probability score of belonging to a target and a probability score of not belonging to a target. In this embodiment, if k anchor points are set in the area generation network 120, the output of the classification network 140 is 2k classification probability scores.

In some embodiments, an attention subnetwork 160 can be added to the classification network 140. See fig. 5 for a block diagram of a network structure of a classification network and a regression network with an added attention subnetwork. In the classification network 140, the second feature map is first convolved to obtain a feature map as a previous layer of the attention subnetwork 160, and then the previous layer is convolved to obtain a result as a temporary layer. Then, through the processing of the attention subnetwork 160, the prior layers with increased attention are obtained, which focus more on the features corresponding to the class to which the object belongs. And finally, carrying out operations such as convolution and the like on the previous layer with the increased attention to obtain 2k classification probability scores, wherein k is the number of candidate regions contained in the second feature map.

And the regression network 150 is used for processing the second feature map to obtain the bounding box information of the at least one target. In some embodiments, the regression network 150 again uses bounding box regression to obtain the position offset of each candidate region in the second feature map for regression to obtain a more accurate target detection box. In some embodiments, the target detection box is characterized by the coordinates of the center point of the candidate region (x, y) and the width w and height h of the box, i.e., (x, y, w, h). In this embodiment, if k anchor points are set in the area generation network 120, the output of the regression network 150 is 4k coordinate values.

In some embodiments, an attention subnetwork 160 can be added to the regression network 150. Referring to FIG. 5, in the regression network 150, the convolution operation is performed on the second feature map, and the feature map is used as the previous layer of the attention sub-network 160, and then the convolution operation is performed on the previous layer, and the result is used as the temporary layer. The processing by attention subnetwork 160 then results in an increased attention of the previous layer, which is more focused on the location of each bounding box. Finally, the previous layer with increased attention is convolved to obtain 4k coordinate values, where k is the number of candidate regions included in the second feature map.

In some embodiments, classification loss (used to evaluate the accuracy of the classification) and regression loss (used to evaluate the accuracy of the bounding box) may be utilized as the basis for the loss function trained by the target detection network 100. These two loss functions form the complete loss function of the target detection network 100 with a certain weight (e.g., λ). In some embodiments, the complete loss function represents, for example, equation (1):

where i is the index of an anchor point; p is a radical of_iIs the predicted probability that the ith anchor point is the target;

is the probability of the actual bounding box group, if the anchor point is positive,

i.e., 1, if the anchor point is negative,

is 0; t is t_iIs a vector representing the 4 parameterized coordinates of the predicted bounding box;

is and t_iVectors with the same dimension are coordinate vectors of a group route corresponding to the positive anchor point; λ is a balance parameter;

representing the classification loss function, N_clsWhich is indicative of the number of anchor points,

expressing the logarithmic loss of two categories (target and non-target), and the specific form of the function is shown in formula (2);

representing the regression loss function, N_regThe number of coordinates representing the anchor point,

is shown in formula (3), wherein R is

A loss function representing the norm of L1.

It should be noted that the above-mentioned loss functions (formula (1) -formula (3)) are existing loss function forms, which are only one example in this application, and the target detection network 100 may adopt other loss function forms according to specific training requirements during the training process.

In some embodiments, the extraction network 110, the region generation network 120, the ROI pooling network 130, the classification network 140, the regression network 150, and the attention subnetwork 160 can be integrated into the same electronic device. In some embodiments, the extraction network 110, the region generation network 120, the ROI pooling network 130, the classification network 140, the regression network 150, and the attention sub-network 160 may be implemented by being distributed among at least two electronic devices, which are communicatively connected to each other for transmitting the processing data between different networks. The electronic device may be a device with a large number of computing functions, and may be, for example, a notebook computer, a desktop computer, a server, a service cluster, or the like.

Fig. 6 is a schematic structural diagram suitable for implementing an electronic device according to an embodiment of the present application.

As shown in fig. 6, the electronic apparatus 600 includes a Central Processing Unit (CPU)601 which can execute various processes in the foregoing embodiments in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The CPU601, ROM602, and RAM603 are connected to each other via a bus 604. An input/output interface (I/O interface) 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to embodiments of the present application, the attention-based mechanism object detection method described herein may be implemented as a computer software program. For example, embodiments of the present application include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing an attention-based mechanism object detection method. In such embodiments, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611.

As another aspect, the present application also provides a non-transitory computer-readable storage medium, which may be the computer-readable storage medium included in the electronic device in the above embodiment; or it may be a computer-readable storage medium that exists separately and is not built into the electronic device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the attention-based mechanism object detection methods described herein.

Fig. 7 is a flowchart of an attention mechanism-based target detection method according to an embodiment of the present application. The method comprises the following steps 701 to 704:

701. and extracting the characteristics of the image to be detected through a characteristic extraction network to obtain a first characteristic diagram.

Specifically, an image to be detected is input into a target detection network, a feature extraction network in the target detection network performs feature extraction on the image, and a first feature map is output. The space dimension of the first feature map is different from that of the image to be detected, the information content of the first feature map in the dimensions of length and width is smaller than that of the image to be detected, but the information content of the first feature map in the dimension of depth is larger than that of the image to be detected. For example, the length, width, and depth dimension information of the image to be detected is 512 × 32, and after the processing by the feature extraction network 110, the length, width, and depth dimension information of the first feature map is 32 × 512, and the depth information is increased.

702. And processing the first feature map through the area generation network to obtain at least one candidate area.

Specifically, the first feature map is input into a region generation network in the target detection network, and after processing, information of one or more candidate regions ROI can be output. The number and size of the output candidate regions ROI are related to anchor points (anchors) preset in a sliding window in the region generation network. If each sliding window in the area generation network is provided with k anchor points with different size proportions, k candidate areas appear at the position covered by each sliding window. In some embodiments, the output of the region generation network is an N × 5 matrix, N representing the number of ROIs, the first column in the matrix representing the ROI identification (e.g., number), and the last four columns representing the coordinate information of the ROI, which is the absolute coordinate corresponding to the image to be detected. In some embodiments, the coordinate information may be represented as coordinates of two diagonal points of the ROI, such as an upper left corner coordinate and a lower right corner coordinate, or an upper right corner coordinate and a lower left corner coordinate. In some embodiments, the coordinate information may also be expressed as a center coordinate of the ROI and a width w and a height h of the ROI.

703. And processing the first feature map and the at least one candidate region through the ROI pooling network to obtain a second feature map, wherein the second feature map comprises the at least one candidate region.

Specifically, the first feature map and at least one candidate region obtained as described above are input to an ROI pooling network, which first maps ROI information of each candidate region into the first feature map. In some embodiments, if the scales of the candidate region and the first feature map are not consistent, considering that the ROI coordinate information is an absolute coordinate in the image to be detected, the coordinates of the first feature map may be scaled according to a proportional relationship between the image to be detected and the first feature map so that the scales of the coordinates of the candidate region and the first feature map are consistent. Then, through maximum pooling, a feature map of a fixed size (w × h) is generated for each candidate region ROI, i.e., a second feature map is obtained.

704. Respectively processing the second feature map through a classification network and a regression network to obtain a category score and bounding box information of at least one target; wherein at least one of the feature extraction network, the classification network, and the regression network comprises an attention subnetwork.

Specifically, the second feature maps are input into the classification network and the regression network, respectively. And calculating the class to which the second feature map corresponding to each candidate region belongs through the processing of the classification network, and outputting the prediction probability of the second feature map belonging to each class, namely a classification probability vector. In some embodiments, the classification network is a two-classification process that outputs two prediction probabilities for each candidate region: a probability score of belonging to a target and a probability score of not belonging to a target. In this embodiment, if k anchor points are set in the area generation network, the output of the classification network 140 is 2k classification probability scores (i.e., category scores). And through the processing of the regression network, obtaining the position offset of each candidate region in the second feature map by using the bounding box regression again, and obtaining a more accurate target detection frame by using the regression. In some embodiments, the target detection box is characterized by the coordinates of the center point of the candidate region (x, y) and the width w and height h of the box, i.e., (x, y, w, h). In this embodiment, if k anchor points are set in the area generation network 120, the output of the regression network 150 is 4k coordinate values (i.e., bounding box information).

In some embodiments, before inputting the second feature map into the classification network and the regression network, the second feature map may be input into a fully connected layer to obtain a fully connected feature map corresponding to the second feature map, and the fully connected feature map may be input into the classification network and the regression network, respectively.

In some embodiments, at least one of the feature extraction network, the classification network, and the regression network described above comprises an attention subnetwork. In this way, in the processing procedures of steps 701 and 704, the processing procedure of the attention subnetwork is added, so that the feature extraction procedure focuses more on the features of the target, the classification category and the position of the bounding box, and the target classification precision and the target positioning precision are improved, thereby improving the accuracy of target detection.

In some embodiments, the attention subnetwork process includes steps 801 to 804:

801. determining a previous layer, which is a feature map to be configured with attention weights.

Specifically, the attention subnetwork in the application realizes attention increase by adding weight to information needing attention in the network, and in order to ensure reliability of the weight, the weight is set to be calculated through a feature map in the network. Therefore, the attention subnetwork will select the feature map containing the information of interest, i.e. the previous layer to be configured with the attention weight. In some embodiments, the previous layer may be the first feature map if an attention subnetwork is added to the feature extraction network. In some embodiments, if the feature extraction network is a multi-scale feature extraction network, that is, the feature extraction network extracts a plurality of first feature maps with different scales, the previous layer may be the first feature map of the larger one of the plurality of scales. In some embodiments, if the feature extraction network extracts a plurality of first feature maps with different scales, the previous layer may be a first feature map with any scale. In some embodiments, if an attention subnetwork is added to the classification network or the regression network, the prior layer may be obtained based on the second feature map, e.g., by performing a convolution operation on the second feature map, and the resulting feature map may serve as the prior layer.

802. A temporary layer is determined.

In some embodiments, the attention subnetwork determines the temporary layer by performing a convolution operation on the previous layer. Thus, the temporary layer is also from the characteristic diagram in the network, thereby avoiding introducing external errors and further ensuring the reliability of subsequent weights.

In some embodiments, if the feature extraction network extracts a plurality of first feature maps with different scales, and the previous layer is a first feature map with a first scale, a first feature map with a second scale smaller than the first scale is selected as the temporary layer. For example, referring to fig. 3, the first feature map F1 may be selected as a previous layer, and one of the first feature maps F2, F3, F4, and F5 may be selected as a temporary layer. Therefore, in the process of determining the attention weight, the temporary layer can be subjected to upsampling operation, and loss of characteristic information is avoided.

In some embodiments, if the feature extraction network extracts a plurality of first feature maps with different scales, and the previous layer is a first feature map with any scale, the process of determining the temporary layer is as follows: and selecting at least two scales of first feature maps which are smaller than the scale of the previous layer from the residual scales, performing convolution operation on the selected first feature maps to obtain a combined feature map, and using the combined feature map as a temporary layer. For example, in fig. 3, the first feature map F1 is selected as the previous layer, and the first feature maps F2, F3, F4, and F5 are selected to be combined, resulting in the temporary layer.

803. Based on the previous layer and the temporary layer, an attention weight is determined.

In particular, after the attention subnetwork determines the previous layer and the temporary layer, the attention weight can be obtained by performing various operations on the previous layer and the temporary layer. In some embodiments, the add operation is performed on the previous layer and the temporary layer first; and performing convolution and sigmoid activation on the feature graph obtained by the addition operation to obtain the attention weight matched with the previous layer. The attention weight represents the attention of the network, and a place with a large weight is an attention concentration point, and the probability that the place is a target is increased.

In some embodiments, the attention subnetwork unifies the scale of the previous layer and the scale of the temporary layer before determining the attention weight based on the previous layer and the temporary layer, and the unification of the scales may be performed by upsampling a layer with a smaller scale or downsampling a layer with a larger scale. If the scales of the previous layer and the temporary layer are not uniform, the subsequent addition operation cannot be performed, so that the scales of the previous layer and the temporary layer can be uniform in an up-sampling or down-sampling mode. In some embodiments, the scale of the temporary layer may be chosen to be uniform to that of the previous layer, given that the previous layer is a feature map for subsequent target detection, while the temporary layer is just an intermediate transition layer for calculating attention weights.

804. And multiplying the prior layer and the attention weight to realize the attention increase of the prior layer.

Specifically, the prior layer and the attention weight are multiplied to obtain the prior layer with increased attention.

It should be noted that, if the feature extraction network extracts a plurality of first feature maps with different scales, an attention subnetwork may be added to at least one scale. For example, in fig. 3, attention subnetworks may be added to one of the scales, or to two, three, four, or even five scales.

In summary, the target detection method based on the attention mechanism provided by the application adds an attention subnetwork to at least one of the feature extraction network, the classification network and the regression network in the target detection network, so that the target detection network focuses more on at least one of the features, the categories and the positions of the target to be focused in the image, and the target classification precision and the target positioning precision are improved, thereby improving the target detection precision. In addition, the added attention sub-network only needs to add a few operations of convolution and activation on the basis of the original network, and the calculation amount and the calculation time of the network are not increased. On the contrary, the increase of the attention sub-network enables the target detection network to more rapidly focus on the characteristics, the types and the positions of the targets, saves the training time and the detection time of the network to a certain extent, and improves the speed of target detection.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of target detection based on an attention mechanism, the method comprising:

2. The method of claim 1, wherein the determining a temporary layer comprises: and performing convolution operation on the prior layer to obtain a temporary layer.

3. The method according to claim 1, wherein the feature extraction network extracts a plurality of first feature maps with different scales;

the determining the previous layer comprises: selecting a first feature map of a first scale as a previous layer;

the determining a temporary layer includes: selecting a first feature map of a second scale as a temporary layer;

wherein the second dimension is less than the first dimension.

4. The method according to claim 1, wherein the feature extraction network extracts a plurality of first feature maps with different scales;

the determining the previous layer comprises: selecting a first feature map of any scale as a previous layer;

the determining a temporary layer includes: selecting a plurality of first feature maps, and merging the selected first feature maps into a temporary layer through convolution operation;

wherein the selected plurality of first feature maps are all smaller in scale than the preceding layer.

5. The method of claim 1, wherein determining an attention weight based on the previous layer and the temporary layer comprises:

performing an addition operation on the previous layer and the temporary layer;

performing convolution operation on the feature map obtained by the addition operation;

and performing activation operation on the feature map obtained by the convolution operation to obtain the attention weight.

6. The method of any of claims 1 to 5, wherein prior to determining the attention weight based on the previous layer and the temporary layer, the attention subnetwork is further configured to:

unifying the scale of the previous layer and the scale of the temporary layer, wherein the unifying comprises: the layer with the smaller scale is up-sampled, or the layer with the larger scale is down-sampled.

7. An attention-based target detection network, the network comprising:

8. The object detection network of claim 7, wherein the attention subnetwork determining the temporary layer comprises: and performing convolution operation on the prior layer to obtain a temporary layer.

9. The object detection network of claim 7, wherein the feature extraction network extracts a plurality of first feature maps with different scales;

the determining of the previous layer by the attention subnetwork of the feature extraction network comprises: selecting a first feature map of a first scale as a previous layer;

the determining of the temporary layer by the attention subnetwork of the feature extraction network comprises: selecting a first feature map of a second scale as a temporary layer;

wherein the second dimension is less than the first dimension.

10. The object detection network of claim 7, wherein the feature extraction network extracts a plurality of first feature maps with different scales;

the determining of the previous layer by the attention subnetwork of the feature extraction network comprises: selecting a first feature map of any scale as a previous layer;

the determining of the temporary layer by the attention subnetwork of the feature extraction network comprises: selecting a plurality of first feature maps, and merging the selected first feature maps into a temporary layer through convolution operation;

11. The object detection network of claim 7, wherein the attention subnetwork determining the attention weight based on the previous layer and the temporary layer comprises:

performing an addition operation on the previous layer and the temporary layer;

12. Object detection network according to any of claims 7 to 11, wherein said attention subnetwork is further adapted to, before determining an attention weight based on said preceding layer and said temporary layer:

13. An electronic device, comprising: a processor and a memory;

the processor is adapted to perform the steps of the method of any one of claims 1 to 6 by calling a program or instructions stored in the memory.

14. A non-transitory computer-readable storage medium storing a program or instructions for causing a computer to perform the steps of the method according to any one of claims 1 to 6.