CN113642572A

CN113642572A - Image target detection method, system and device based on multi-level attention

Info

Publication number: CN113642572A
Application number: CN202110798192.5A
Authority: CN
Inventors: 张重阳; 赵炳堃
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2021-11-12
Anticipated expiration: 2041-07-15
Also published as: CN113642572B

Abstract

The invention discloses a method, a system and a device for detecting an image target based on multi-level attention, which comprises the following steps: constructing a feature extractor based on the deep convolutional neural network, using the feature extractor as a backbone network, and inputting the image into the backbone network to extract the depth features of the image; constructing a branch network based on the convolutional neural network as an attention branch; inputting the depth feature of the image into the attention branch to obtain a multi-level attention weight map; multiplying the multilevel attention weight graph by the depth feature of the image to obtain a weighted feature graph; inputting the weighted feature map into an RPN module to obtain a target candidate frame; and sending the weighted feature map corresponding to the target candidate frame to a classification and regression module to obtain a target detection frame. The invention can extract the target in the image in a grading way according to the interested degree, thereby greatly reducing the false detection caused by the disturbance of the background area in the detection of the specific image target such as personnel in the monitoring system, the defect in the industrial product and other image targets.

Description

Image target detection method, system and device based on multi-level attention

Technical Field

The invention relates to the technical field of image target detection, in particular to a method, a system and a device for detecting an image target based on multi-level attention.

Background

The task of target detection is to find out all interested targets (objects) in the image and determine their positions and sizes, which is a research direction of a core in the field of computer vision and has wide application requirements in scenes such as automatic driving, security monitoring, industrial manufacturing and the like.

Traditional target detection is mainly based on artificially designed features, such as HOG, DPM and other methods, and features of an interested target are designed manually and then classified by a classifier such as SVM. The traditional target detection algorithm is only suitable for scenes with obvious and single characteristics, and corresponding characteristics are difficult to design manually for some complex scenes.

In recent years, with the rapid development of deep learning technology in the field of computer vision, the target detection algorithm based on deep learning has been advanced greatly. After R.Girshick et al pioneering the R-CNN (regions with CNN) algorithm in 2014, the target detection field enters the deep learning period. The R-CNN algorithm firstly uses a selective search method to lead out a series of suggested areas, then inputs the suggested areas into a CNN model to extract features, and finally uses an SVM linear classifier to classify all the suggested areas, thereby obtaining excellent detection performance. And then Fast R-CNN designs a multi-task loss function, and the classification task and the frame are regressed and unified into the same network, so that the training speed is obviously improved. The subsequent Faster R-CNN algorithm proposed by anyone in Cynanchum paniculatum et al further breaks through the speed bottleneck of Fast R-CNN, and generates a suggestion region by introducing an RPN network instead of selective search, thereby truly realizing end-to-end target detection. In addition, there is a single-Stage (One-Stage) target detection algorithm represented by algorithms such as YOLO and SSD, which converts a target detection task into a regression task. The YOLO algorithm abandons the previous thought of extracting a candidate box and verifying, and applies a single neural network to the whole image, thereby achieving high detection speed. Firstly, the whole graph is divided into grids, each grid is responsible for detecting targets with central points positioned in the grid, a certain number of detection frames and confidence degrees of the detection frames are predicted for each unit grid, and finally the prediction frames are screened through non-maximum value inhibition.

However, the above target detection based on deep learning is basically directed to detection of general objects, and in practical application scenarios, the scenarios are often more complex, and features of the object to be detected may not be sufficiently prominent, so that the detector may be interfered by a background or some other objects with similar features, thereby causing a large amount of false detections. For example, in defect detection of a reed switch, detection of foreign matter in a reed contact area is easily affected by dirt or fine fibers on a glass tube wall because they have similar characteristics to the foreign matter in the reed contact area, thereby causing erroneous judgment. Some solutions to such problems employ a two-stage detection method, that is, a region where a target object is most likely to adhere is detected first, and then the target object is detected in the region, so as to achieve the purpose of filtering out false detection samples outside the region of interest. Firstly, the method carries out two model reasoning processes, so that the detection efficiency is sacrificed to a certain extent; secondly, the method needs to train two detection models, and an end-to-end structure is not formed; in addition, this method performs two calculations on the features of the same area, thereby wasting computational resources.

Therefore, how to design an end-to-end network structure at the algorithm level and to assist with a complete set of complete detection system and apparatus to solve some pain points faced in the above practical industrial scenarios is a very worthy of research.

Through retrieval, chinese patent CN 112686304A discloses a target detection method, device and storage medium based on an attention mechanism and multi-scale feature fusion, which only focuses on global attention, does not consider the interference of a specific background region in an image to a target to be detected, and is difficult to effectively filter out some objects in the background region similar to the features of the target to be detected.

Disclosure of Invention

The invention provides a method, a system and a device for detecting an image target based on multi-level attention, aiming at the problems in the scene, wherein multi-level weighting is carried out by taking an attention mechanism as a characteristic diagram, so that difficult samples except an interested area are filtered to a certain extent, and the false detection rate is reduced.

In a first aspect of the present invention, an image target detection method based on a multi-stage attention mechanism is provided, which includes:

s1, constructing a feature extractor based on the deep convolutional neural network, using the feature extractor as a backbone network, and inputting the image into the backbone network to extract the depth features of the image;

s2, constructing a branch network based on the convolutional neural network as an attention branch for extracting a multi-level attention weight map;

s3, inputting the depth feature of the image into the attention branch to obtain a multi-level attention weight map;

s4, multiplying the multilevel attention weight map and the depth feature of the image to obtain a weighted feature map;

s5, inputting the weighted feature map into an RPN module to obtain a series of target candidate frames;

and S6, sending the weighted feature map corresponding to the target candidate frame to a classification and regression module to finally obtain a target detection frame.

Preferably, the S2, including:

s21, performing dimensionality reduction on the depth features of the image of S1 through convolution operation to obtain an output feature map with the same scale and the channel number of 1;

and S22, performing convolution operation on the output characteristic diagram obtained in the step S21 to obtain a multi-level attention weight diagram with a value between 0 and 1 as the output of the attention branch.

Preferably, the S3 further comprises providing supervision information for the attention branch, including:

s31, collecting a large number of images containing the object to be detected, constructing a training data set, labeling the training data set, and marking the position, size and category information corresponding to the object to be detected and the area to which the object to be detected is attached, namely the position, size and category information of the region of interest;

s32, generating a zero matrix with the same depth characteristic scale as that of the image of S1, and carrying out equal-proportion transformation on the position and the size of the region of interest in the training image to obtain a transformation coordinate;

s33, in the zero matrix, according to the transformation coordinates obtained in S32, assigning values to the positions corresponding to the transformed interested areas according to the primary interested area, the secondary interested area and the uninteresting area respectively, wherein different areas correspond to different values;

s34, during training, the matrix after being assigned in the S33 is used as supervision information of the attention weight graph, and the Loss of the attention branch is calculated through a Loss function and is marked as Loss_a。

Preferably, the attention weight map output in S2 represents the degree of importance of different regions, wherein the primary region of interest has the largest weight and the secondary region of interest has the next highest weight. In S4, the weighted feature map is obtained by multiplying the attention weight map by the feature map output in S1.

Preferably, in S6, the classification and regression module includes: the classification network is used for classifying the weighted feature maps corresponding to the target candidate frames and outputting specific categories of the weighted feature maps; and the regression network is used for finely adjusting the position of the target candidate frame. The classification and regression networks respectively obtain a Loss during training, and the two losses are added with the Loss of the attention branch obtained in S3 to serve as the total Loss of the whole network, so that end-to-end training is realized.

In a second aspect of the present invention, there is provided an image target detection system based on a multi-stage attention mechanism, comprising:

the characteristic extraction module constructs a characteristic extractor based on the deep convolutional neural network, and the characteristic extractor is used as a backbone network to input the image into the backbone network to extract the depth characteristic of the image;

the attention branch module is used for constructing a branch network based on the convolutional neural network to serve as an attention branch and extracting a multi-level attention weight map;

a multi-level attention weight map acquisition module, which inputs the depth features of the image obtained by the feature extraction module into the attention branches constructed by the attention branch module to obtain a multi-level attention weight map;

a weighted feature map acquisition module, which multiplies the multi-level attention weight map obtained by the multi-level attention weight map acquisition module by the depth feature of the image obtained by the feature extraction module to obtain a weighted feature map;

a target candidate frame acquisition module, which inputs the weighted feature map obtained by the weighted feature map acquisition module into an RPN module to obtain a series of target candidate frames;

and the classification and regression module is used for classifying and regressing according to the weighted feature map corresponding to the target candidate frame obtained by the target candidate frame obtaining module to obtain the target detection frame.

In a third aspect of the present invention, there is provided an image object detection apparatus based on multi-level attention, comprising:

the image acquisition module is used for capturing a target to be detected, acquiring an image or a video containing the target to be detected in a specific scene and then carrying out subsequent detection;

the detection module is used for detecting the image acquired by the image acquisition module to obtain a specific detection result and displaying or feeding back the detection result to the control module; the detection adopts the image target detection method based on the multi-stage attention mechanism.

In a fourth aspect of the present invention, there is provided a computer device comprising at least one processor and at least one memory, wherein the memory stores a computer program which, when executed by the processor, enables the processor to perform the multi-level attention mechanism-based image object detection method.

In a fifth aspect of the present invention, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor within a device, enable the device to perform the above-mentioned image object detection method based on a multi-level attention mechanism.

Compared with the prior art, the invention has the following advantages:

(1) according to the method, a certain weight is given to the extracted depth features through a multi-level attention mechanism, different weights are given to different regions according to the interested degree, for example, a higher weight is given to a region which is relatively interested, and a lower weight is given to a region which is easy to cause false detection, so that the occurrence probability of false detection caused by background disturbance can be reduced to a certain extent;

(2) the invention adds the loss of the branch to the total loss by designing a branch structure, thereby realizing end-to-end training and leading the training and reasoning of the detection process to be more concise;

(3) the branch structure introduced in the invention only brings a small amount of computational complexity, and compared with a two-stage detection method for carrying out twice reasoning, the method avoids repeated computation of the characteristic diagram, thereby accelerating the reasoning efficiency in the detection process.

(4) Based on the method, the invention provides a detection system and a detection device to automatically detect the surface defects in the production process of industrial products, thereby replacing the manual labor to a certain extent and saving the labor cost.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings:

FIG. 1 is a flowchart of a multi-level attention-based image target detection method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an attention-branch according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an embodiment of monitoring information during training of attention deficit;

fig. 4 is a schematic diagram of a system and an apparatus for multi-level attention-based image target detection according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

Fig. 1 is a flowchart illustrating a multi-level attention-based image target detection method according to an embodiment of the present invention.

Referring to fig. 1, the method for detecting an image target based on a multi-level attention mechanism of the present embodiment includes:

s1, constructing a feature extractor which is called a backbone network based on the deep convolutional neural network, inputting the image into the backbone network, and extracting the depth features of the image, which are marked as M1;

s3, taking the depth feature M1 of the image obtained in S1 as the input of the attention branch, and outputting a multi-level attention weight map, which is marked as W;

s4, multiplying the output W of S3 and the output M1 of S1 to obtain a weighted characteristic diagram, which is marked as M2;

s5, taking the weighted feature map M2 of S4 as the input of the RPN module to obtain a series of target candidate frames, and marking the target candidate frames as B1;

and S6, sending the weighted feature map corresponding to the target candidate box output in the S5 to a classification and regression module, and finally obtaining a target detection box marked as B2.

In S1, in a preferred embodiment, feature extraction is performed through a backbone network, where ResNet-50 may be used as the backbone network, each stage is subjected to convolution and pooling operations to reduce the size of the feature map and increase the number of channels of the feature map, and finally, the output feature map of the fourth stage is selected as the output of the backbone network, i.e., the depth feature M1 of the image. Of course, in other embodiments, other networks may be used, and are not limited to ResNet-50.

In a specific embodiment, S1 may refer to the following operations:

s11, carrying out preprocessing such as zooming on the input image, and then sequentially sending the input image into a convolution layer and a pooling layer to obtain the characteristic of the first stage, namely a shallow characteristic, wherein the size of the shallow characteristic is reduced to be half of the size of the preprocessed image, and the number of channels is 64;

s12, performing convolution and pooling on the shallow layer features to obtain second-stage features, namely intermediate layer features, wherein the size of the intermediate layer features is reduced to half of that of the shallow layer features, and the number of channels is 128;

s13, performing convolution and pooling operations on the intermediate layer characteristics to obtain characteristics of a third stage, namely deeper layer characteristics, wherein the size of the deeper layer characteristics is reduced to half of the intermediate layer characteristics, and the number of channels is 256;

and S14, performing convolution and pooling on the deeper features to obtain features of a fourth stage, namely deep features, which are used as an output feature map of the backbone network, wherein the size of the output feature map is reduced to be half of the deeper features, and the number of channels is 512.

In a preferred example, in S2, in order to construct a branch network for extracting a multi-level attention weight map, the method may include:

s21, performing convolution on the output characteristic diagram M1 of the S1 by 3x3 to reduce the dimensionality to 1, and obtaining an output characteristic diagram which has the same scale as that of M1 and the number of channels is 1;

and S22, performing convolution operation on the output characteristic diagram of the S21 by 3x3 to obtain an attention weight diagram with a value between 0 and 1 as the output of the attention branch, which is marked as W.

In this embodiment, different regions in the image are classified into different attention levels according to the degree of interest, the different attention levels correspond to different values, and the higher the attention level is, the larger the value is. Repeated calculation of the feature map is avoided through the introduced branch structure, and therefore processing efficiency is improved.

In the preferred example, in S3, a multi-level attention weight map is obtained through the attention branch, and in order to guide the attention branch to generate a larger weight for the region of interest, the attention branch should be provided with corresponding supervision information. For example, in one embodiment, the region of interest is the contact area of the reed, since foreign matter is always present attached to the reed contact area. Specifically, the method comprises the following steps:

s31, labeling the training data set, in addition to labeling the position, size, and category information corresponding to the foreign object in the object to be detected, i.e., the reed contact area, and labeling the area to which the object to be detected is attached, i.e., the region of interest, i.e., the position, size, and category information of the reed contact area in this example.

And S32, generating a zero matrix with the same scale as the output feature map M1 of the S1, and carrying out equal-scale transformation on the position and the size of the region of interest in the training picture, namely the reed contact region. For example, assume that the width and height of the picture after the S1 is input and preprocessed are W, H; the width and height of the output characteristic map M1 of S1 are W₁、H₁(ii) a The picture corresponds to the region of interest, i.e. the location coordinates of the reed contact area, as ((x)₁₁，y₁₁)，(x₁₂，y₁₂) The position coordinates of the corresponding area after transformation are derived by the following formula:

x₂₁＝x₁₁·W₁/W

y₂₁＝y₁₁·H₁/H

x₂₂＝x₁₂·W₁/W

y₂₂＝y₁₂·H₁/H

wherein x₂₁，y₂₁，x₂₂，y₂₂Respectively representing the abscissa of the upper left corner, the ordinate of the upper left corner, the abscissa of the lower right corner and the ordinate of the lower right corner of the region of interest after transformation. The set of coordinates corresponds to the coordinates of the reed contact area after being scaled.

And S33, in the zero matrix, according to the transformation coordinates obtained in S32, assigning a transformed region of interest, namely the position corresponding to the reed contact area, as 1, wherein the region represents a primary region of interest, and the practical meaning is that foreign matters in the reed contact area influence the switching performance of the reed and should be accurately detected. The other position, which is at the same level as the primary region of interest, is assigned 0.5, called the secondary region of interest, which is in reality in that foreign objects on the reed non-contact area do not affect the performance of the reed switch temporarily, but may move to the reed contact area later, and thus can be detected with a lower degree of confidence. The remaining area, called the region of no interest, which is kept at 0, represents that other areas, such as smudges on the glass tube wall, fibers, etc., which are easily misdetected, should be filtered out. Of course, in other embodiments, other value assignment rules may be adopted, which are mainly used to distinguish different areas, and this is only an example and is not intended to limit the present invention.

S34, during training, the matrix after assignment in the S33 is used as supervision information of the attention weight graph, and the Loss of the attention branch is calculated through a Loss function and is marked as Loss_a。

The supervision information obtained by the embodiment can guide attention branching, and can generate a larger weight for the region of interest, so that the target detection probability in the region is increased, and conversely, the target detection probability in the region with low target occurrence probability, such as the background, is reduced because the weight is relatively smaller, so that the false detection caused by the interference of the background and the like is reduced.

In a preferred embodiment, the attention weight graph W output in S2 represents the importance of different regions, wherein the primary region of interest, i.e., the reed contact region, has the greatest weight, the secondary region of interest, i.e., the reed non-contact region, has the next greatest weight, and the other regions of no interest have a weight of 0. In S4, the attention weight map W is multiplied by the feature map M1 output in S1 to obtain a weighted feature map, which is denoted as M2.

In the preferred embodiment, in S6, the classification and regression networks respectively obtain a Loss during training, and the two losses are added to the Loss of the attention branch obtained in S3 to obtain the total Loss of the entire network, thereby achieving end-to-end training. As shown in the following equation:

Loss＝Loss_a+Loss_cls+Loss_reg

wherein Loss represents the total Loss of the entire network_aLoss, Loss representing a branch of attention_clsLoss, Loss on behalf of a classified network_regRepresents the Loss of the bounding box regression network.

In this embodiment, the loss of the branch is added to the total loss, so that end-to-end training is realized, and the training and reasoning of the network are more concise.

Based on the same technical concept, another embodiment of the present invention further provides an image target detection system based on a multi-stage attention mechanism, including:

the feature extraction module is used for constructing a feature extractor based on the deep convolutional neural network, and inputting the image into the backbone network to extract the depth features of the image;

the multi-level attention weight map acquisition module inputs the depth features of the image obtained by the feature extraction module into the attention branches constructed by the attention branch module to obtain a multi-level attention weight map;

a target candidate frame acquisition module, which inputs the weighted feature map obtained by the weighted feature map acquisition module into the RPN module to obtain a series of target candidate frames;

The specific implementation technology of each module in the embodiment of the image target detection system based on multi-level attention of the present invention may refer to the steps corresponding to the method, and will not be described herein again. The embodiment of the invention can meet the requirement of real-time detection and is more suitable for application in industrial scenes.

Based on the detection method and system, in another embodiment of the present invention, a multi-level attention-based image object detection apparatus is provided, in which the multi-level attention-based image object detection method is adopted for implementing a task of detecting a specific object in an image. Specifically, the image target detection device based on multi-level attention comprises: the image acquisition module is used for capturing a target to be detected, acquiring an image or a video containing the target to be detected in a specific scene and then carrying out subsequent detection; the detection module is used for detecting the image acquired by the image acquisition module to obtain a specific detection result and displaying or feeding back the detection result to the control module; the detection adopts the image target detection method based on the multi-stage attention mechanism in any one of the above embodiments.

Further, in order to clearly understand the technical solutions described above, the following description will be given in detail by taking a case of defect detection applied in an industrial scene as an example, but it should be understood that the example is not intended to limit the application of the present invention.

Specifically, referring to fig. 4, the image target detection method based on multi-level attention is applied to defect detection in an industrial scene. The embodiment is applied to defect detection in an industrial scene, and the detection target is a tiny foreign matter on a reed contact area in the magnetic reed switch. Since dirt or some fine fibers are often present on the wall of the reed switch glass tube, and their characteristics are similar to those of foreign objects on the reed contact area, it is difficult to completely distinguish the two by using the conventional target detection algorithm. These objects do not affect the function of the reed switch, and therefore a large number of false detections exist in the detection process. In view of this, the present embodiment adopts a product defect detecting apparatus based on multi-level attention, which includes: the device comprises a mechanical transmission module, an image acquisition module, a detection module and a software and hardware communication module. The mechanical transmission module conveys, rotates and grabs a product to be detected (a magnetic reed switch); the image acquisition module acquires images of conveyed and rotated products to be detected (magnetic reed switches) in the working process of the mechanical transmission module; the detection module processes and analyzes the image acquired by the image acquisition module, specifically adopts the image target detection method based on multi-level attention in the embodiment to obtain a specific detection result, and feeds the detection result back to the mechanical transmission module, and the mechanical transmission module carries out classified grabbing on the product to be detected (the magnetic reed switch) according to the fed-back detection result. Further, the device may further comprise a communication module for communication between the mechanical transmission module and the detection module.

Specifically, in a preferred embodiment, the mechanical transmission module performs conveying, rotating and grabbing of the product to be detected, and may include:

the material conveying module: the reed switch is automatically conveyed in a pipeline mode so as to carry out detection and classification.

The material rotation module: the magnetic reed switch can be grabbed, rotated and the like, rotated at any angle according to the axis, and continuously grabbed and subjected to image acquisition in the rotating process, so that the magnetic reed switch can be subjected to all-dimensional dead-angle-free detection.

The material dividing module: the magnetic reed switches are classified according to detection results, when the magnetic reed switches are conveyed to the tail end of the conveying belt by the conveying module, the material sorting module is used for grabbing the magnetic reed switches and putting the magnetic reed switches according to the detection results, the magnetic reed switches are divided into good products and defective products, and the defective products are further classified according to specific defect categories.

The programmable logic controller: the material distribution module is used for controlling the operation of the whole machine, and comprises a material conveying module, a material rotating module and a material distribution module, so that an operable interface is formed, and the operation such as control, parameter setting and the like can be performed on a mechanical device.

Specifically, in a preferred embodiment, the image capturing module may employ an optical image capturing device, and may include: the optical microscope is used for imaging the surface condition of the magnetic reed switch, and industrial products can be magnified and observed at a certain multiplying power for image acquisition. The light source system comprises a light source and a light source controller, and is used for providing good illumination conditions for the optical microscope. Industrial cameras are used to capture optical images captured by an optical microscope into a series of images or videos for subsequent inspection.

Of course, in other embodiments, the optical image capturing device, including but not limited to a visual microscope, a monitoring probe, etc., may alternatively be other devices for optically imaging and capturing the object to be inspected, such as an industrial product, as a digital image.

Specifically, in a preferred embodiment, the detection module may specifically include two parts, namely hardware and software, where the hardware is a computer, such as a high-performance GPU computer, and the computer is configured to run the detection software, detect the image acquired by the image acquisition module, feed the obtained detection result back to the control system, and feed the detection result obtained by the operation back to the control system, such as an industrial material sorting module and a security alarm linkage module, to finally realize sorting of defective products of industrial products, audible and visual alarm of abnormal conditions, and the like. In this embodiment, the detection result can be fed back to the sorting module, and finally, the detection and sorting of the reed switch are realized. The detection software mainly detects the acquired pictures, wherein the detection software comprises a graphical user interface, a user can check the real-time pictures acquired by the image acquisition module in real time and present the detection results of each time, and the detection software also comprises the functions of parameter configuration, data statistics, log recording and the like.

Specifically, in a preferred embodiment, the hardware and software communication module is a signal conversion and transmission module, including but not limited to a switching value module, which can be used for signal conversion and communication between the control system and the high-performance GPU computer. The signal conversion is the conversion of digital quantity into analog quantity or switching quantity, and is used for realizing the signal control of mechanical devices and the like. In this embodiment, a switching value module may be used, and the high module is used for communication between the high-performance GPU computer and mechanical devices such as the material rotation module and the material fetching module. Specifically, the material rotation module can send a signal for starting detection to the computer through the switching value module while starting rotation, so that the computer can start detection of the current reed switch; and after the computer detects a reed switch, a detection result can be sent to the material sorting module through the switching value module, so that the reed switch can be sorted.

In another embodiment of the present invention, there is further provided an anomaly detection apparatus based on a logging mechanism, including at least one processor and at least one memory, where the memory stores a computer program, and when the program is executed by the processor, the processor is enabled to execute the anomaly detection method based on the logging mechanism of any one of the above embodiments.

In another embodiment of the present invention, a computer-readable storage medium is further provided, wherein when the instructions in the storage medium are executed by a processor in a device, the device is enabled to execute any one of the above-mentioned anomaly detection methods based on a memorization mechanism.

It should be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An image target detection method based on a multi-stage attention mechanism is characterized by comprising the following steps:

2. The image target detection method based on multi-stage attention mechanism as claimed in claim 1, wherein said S2 comprises:

3. The method for image object detection based on multi-level attention mechanism as claimed in claim 1, wherein said S3 further comprises providing supervision information for said attention branch, including:

s34, training will be as followsThe matrix assigned in the step S33 is used as the supervision information of the attention weight graph, and the Loss of the attention branch is calculated by the Loss function and is denoted as Loss_a。

4. The method for detecting image targets based on multi-level attention mechanism as claimed in claim 3, wherein in S32, it is assumed that the width and height of the original image inputted into S1 are W, H; the width and height of the output characteristic diagram of S1 are W₁、H₁(ii) a The picture has a position coordinate of ((x) corresponding to the region of interest₁₁，y₁₁)，(x₁₂，y₁₂) The position coordinates of the corresponding area after transformation are derived by the following formula:

x₂₁＝x₁₁·W₁/W

y₂₁＝y₁₁·H₁/H

x₂₂＝x₁₂·W₁/W

y₂₂＝y₁₂·H₁/H

wherein x₂₁，y₂₁，x₂₂，y₂₂Respectively representing the abscissa of the upper left corner, the ordinate of the upper left corner, the abscissa of the lower right corner and the ordinate of the lower right corner of the region of interest after transformation.

5. The image target detection method based on the multi-level attention mechanism as claimed in claim 3, wherein in S33, the position corresponding to the region of interest after transformation is assigned with a larger value ranging from 0 to 1, where the region represents a primary region of interest; assigning a smaller value with the value range of 0 to 1 to other positions which are positioned in the same horizontal line with the primary interested area, and calling the smaller value as a secondary interested area; the remaining area remains 0, called the region of no interest.

6. The image target detection method based on multi-stage attention mechanism as claimed in claim 1, wherein in the step S6, the classification and regression module comprises:

the classification network is used for classifying the weighted feature maps corresponding to the target candidate frames and outputting specific categories of the weighted feature maps;

the regression network is used for finely adjusting the position of the target candidate frame;

the classification and regression networks respectively obtain a Loss during training, and the two losses are added with the Loss of the attention branch of S3 to serve as the total Loss of the whole network, so that end-to-end training is realized.

7. An image target detection system based on a multi-stage attention mechanism, comprising:

8. An image object detecting apparatus based on multi-level attention, comprising:

the detection module is used for detecting the image acquired by the image acquisition module to obtain a specific detection result and displaying or feeding back the detection result to the control module; the detection adopts the image object detection method based on the multi-stage attention mechanism as claimed in any one of claims 1-6.

9. A computer device comprising at least one processor and at least one memory, wherein the memory stores a computer program which, when executed by the processor, enables the processor to perform the method of image object detection based on a multi-level attentional force mechanism of any one of claims 1-6.

10. A computer readable storage medium having instructions which, when executed by a processor within an apparatus, enable the apparatus to perform the method of image object detection based on a multi-level attentional force mechanism of any one of claims 1 to 6.