CN113642572B

CN113642572B - Image target detection method, system and device based on multi-level attention

Info

Publication number: CN113642572B
Application number: CN202110798192.5A
Authority: CN
Inventors: 张重阳; 赵炳堃
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2023-10-27
Anticipated expiration: 2041-07-15
Also published as: CN113642572A

Abstract

The invention discloses an image target detection method, system and device based on multi-level attention, comprising the following steps: constructing a feature extractor based on a depth convolution neural network, taking the feature extractor as a backbone network, and inputting the image into the backbone network to extract the depth features of the image; constructing a branch network based on the convolutional neural network as an attention branch; inputting the depth characteristics of the image into an attention branch to obtain a multi-level attention weight graph; multiplying the multi-level attention weight graph with the depth features of the image to obtain a weighted feature graph; inputting the weighted feature map into an RPN module to obtain a target candidate frame; and sending the weighted feature map corresponding to the target candidate frame into a classification and regression module to obtain a target detection frame. The invention can extract the targets in the images in a grading way according to the interested degree, thereby greatly reducing false detection caused by disturbance of background areas in detection of specific image targets such as personnel in a monitoring system and image targets such as defects in industrial products.

Description

Image target detection method, system and device based on multi-level attention

Technical Field

The invention relates to the technical field of image target detection, in particular to an image target detection method, an image target detection system and an image target detection device based on multistage attention.

Background

The task of target detection is to find out all interested targets (objects) in an image, determine the positions and the sizes of the targets (objects), which is a core research direction in the field of computer vision, and has wide application requirements in scenes such as automatic driving, security monitoring, industrial manufacturing and the like.

Traditional target detection is mainly based on manually designed features, such as HOG, DPM and other methods, and features of the target of interest are manually designed and then classified by an SVM classifier. The traditional target detection algorithm is only suitable for a single scene with obvious characteristics, and for some complex scenes, the corresponding characteristics are difficult to design manually.

In recent years, with the rapid development of deep learning technology in the field of computer vision, a target detection algorithm based on deep learning has also been greatly advanced. After the R-CNN (Regions with CNN) algorithm was originally proposed by r.girshick et al in 2014, the target detection field entered a deep learning period. The R-CNN algorithm firstly uses a selective search method to draw a series of suggested areas, then inputs the suggested areas into a CNN model to extract features, and finally uses an SVM linear classifier to classify all the suggested areas, thus obtaining very good detection performance. The Fast R-CNN designs a multi-task loss function, unifies classification tasks and frame regression into the same network, and remarkably improves training speed. The Fast R-CNN algorithm proposed by Ren Shaoqing and the like further breaks through the speed bottleneck of Fast R-CNN, and a suggestion region is generated by introducing an RPN network instead of selective search, so that the end-to-end target detection is truly realized. In addition, there are single-Stage (One-Stage) object detection algorithms represented by YOLO, SSD, and the like, which convert an object detection task into a regression task. The YOLO algorithm abandons the previous concept of "extracting candidate boxes+verifying", and applies a single neural network to the whole image, thereby achieving a very fast detection speed. Firstly, dividing the whole graph into grids, each grid is responsible for detecting targets with central points in the grid, predicting a certain number of detection frames and confidence of the detection frames for each cell, and finally screening the prediction frames through non-maximum suppression.

However, the above object detection based on deep learning and the like is basically aimed at detecting general objects, but in a practical application scene, the scene tends to be more complex, and the feature of the object to be detected may not be sufficiently prominent, so that the detector may be disturbed by the background or some other object with similar features, thereby causing a large number of false detections. For example, in the defect detection of a reed switch, the detection of foreign matter in the reed contact area is easily affected by stains or fine fibers on the glass tube wall because they have similar characteristics to the foreign matter in the reed contact area, thereby causing erroneous judgment to occur. Some solutions to such problems adopt a two-stage detection method, namely, firstly detecting the most likely attached area of the target object, and then detecting the target object in the area, thereby achieving the purpose of filtering out false detection samples outside the area of interest. However, the method has certain defects, firstly, the method performs a model reasoning process twice, so that the detection efficiency is sacrificed to a certain extent; secondly, this approach requires training two detection models, without forming an end-to-end structure; in addition, this method performs two computations on the features of the same region, thereby wasting computing resources.

Therefore, how to design an end-to-end network structure on the algorithm level and to assist a complete set of detection system and device to solve some pain points faced in the above practical industrial scenario is a very valuable research problem.

Through retrieval, chinese patent CN 112686304A discloses a target detection method, equipment and storage medium based on an attention mechanism and multi-scale feature fusion, wherein the method only focuses on global attention, does not consider the interference of a specific background area in an image on a target to be detected, and is difficult to effectively filter some objects similar to the characteristics of the target to be detected in the background area.

Disclosure of Invention

Aiming at the problems in the scene, the invention provides an image target detection method, an image target detection system and an image target detection device based on multistage attention, which carry out multistage weighting on a characteristic diagram through an attention mechanism, so as to filter difficult-to-separate samples outside an interested area to a certain extent, and reduce false detection rate.

In a first aspect of the present invention, there is provided an image object detection method based on a multi-level attention mechanism, comprising:

s1, constructing a feature extractor based on a deep convolutional neural network, and inputting an image into the backbone network to extract the depth features of the image;

s2, constructing a branch network based on a convolutional neural network, and taking the branch network as an attention branch for extracting a multi-stage attention weight graph;

s3, inputting the depth characteristic of the image into the attention branch to obtain a multi-level attention weight graph;

s4, multiplying the multi-level attention weight graph with the depth features of the image to obtain a weighted feature graph;

s5, inputting the weighted feature map into an RPN module to obtain a series of target candidate frames;

and S6, sending the weighted feature map corresponding to the target candidate frame into a classification and regression module, and finally obtaining a target detection frame.

Preferably, the S2 includes:

s21, performing dimension reduction on the depth features of the image in the S1 through convolution operation to obtain an output feature map with the same dimension and 1 channel number;

s22, carrying out convolution operation on the output characteristic diagram obtained in the S21 to obtain a multi-stage attention weight diagram with a value between 0 and 1, and taking the multi-stage attention weight diagram as the output of the attention branch.

Preferably, the step S3 further includes providing supervision information for the attention branches, including:

s31, collecting a large number of images containing the object to be detected, constructing a training data set, marking the training data set, and marking the position, the size and the type information corresponding to the object to be detected, and the position, the size and the type information of the region to which the object to be detected is attached, namely the region of interest;

s32, generating a zero matrix with the same depth characteristic scale as the image in the S1, and carrying out equal-proportion transformation on the position and the size of the region of interest in the training image to obtain transformation coordinates;

s33, in the zero matrix, assigning values to positions corresponding to the transformed region of interest according to the primary region of interest, the secondary region of interest and the non-region of interest respectively according to the transformation coordinates obtained in the S32, wherein the different regions correspond to different values;

s34, taking the matrix assigned in the S33 as the supervision information of the attention weight graph during training, and calculating the Loss of the attention branches through the Loss function, and marking the Loss as the Loss _a 。

Preferably, the attention weight map output in S2 represents the degree of importance of different regions, the weight of the primary region of interest is the largest, and the weight of the secondary region of interest is the next largest. In S4, the weighted feature map is obtained by multiplying the attention weight map by the feature map output in S1.

Preferably, in the step S6, the classification and regression module includes: the classification network is used for classifying the weighted feature images corresponding to the target candidate frames and outputting specific categories of the weighted feature images; and the regression network is used for fine adjustment of the position of the target candidate frame. The classification and regression networks respectively obtain a Loss during training, and the Loss of the attention branches obtained in the S3 and the Loss of the two Loss are added to be used as the total Loss of the whole network, so that the end-to-end training is realized.

In a second aspect of the present invention, there is provided an image object detection system based on a multi-level attention mechanism, comprising:

the feature extraction module is used for constructing a feature extractor based on a deep convolutional neural network and inputting an image into the backbone network to extract the depth features of the image;

an attention branching module which constructs a branching network based on the convolutional neural network as an attention branch for extracting a multi-stage attention weight map;

the multi-stage attention weight map acquisition module inputs the depth features of the images obtained by the feature extraction module into the attention branches constructed by the attention branch module to obtain a multi-stage attention weight map;

the weighted feature map acquisition module multiplies the multi-level attention weight map obtained by the multi-level attention weight map acquisition module with the depth feature of the image obtained by the feature extraction module to obtain a weighted feature map;

the target candidate frame acquisition module inputs the weighted feature map obtained by the weighted feature map acquisition module into the RPN module to obtain a series of target candidate frames;

and the classification and regression module is used for classifying and regressing according to the weighted feature images corresponding to the target candidate frames obtained by the target candidate frame obtaining module to obtain target detection frames.

In a third aspect of the present invention, there is provided an image object detection apparatus based on multi-level attention, comprising:

the image acquisition module is used for capturing the target to be detected, and acquiring an image or video containing the target to be detected in a specific scene for subsequent detection;

the detection module is used for detecting the image acquired by the image acquisition module to obtain a specific detection result and displaying or feeding back the detection result to the control module; the detection adopts the image target detection method based on the multi-level attention mechanism.

The fourth aspect of the present invention provides a computer device, including at least one processor, and at least one memory, where the memory stores a computer program that, when executed by the processor, enables the processor to perform the multi-level attention mechanism based image object detection method.

In a fifth aspect of the invention, a computer readable storage medium is provided, which when executed by a processor within a device, causes the device to perform the above-described multi-level attention mechanism based image object detection method.

Compared with the prior art, the invention has the following advantages:

(1) According to the invention, a certain weight is given to the extracted depth features through a multi-level attention mechanism, different weights are given to different areas according to the interested degree, for example, a higher weight is given to the area of interest, and a lower weight is given to the area which is easy to cause false detection, so that the occurrence probability of false detection caused by background disturbance can be reduced to a certain extent;

(2) According to the invention, by designing the branch structure, the loss of the branch is added to the total loss, so that the end-to-end training is realized, and the training and reasoning of the detection process are simpler;

(3) The branch structure introduced in the invention only brings a small amount of calculation complexity, and compared with a two-stage detection method for carrying out two-time reasoning, the method avoids repeated calculation of the feature map, thereby accelerating the reasoning efficiency of the detection process.

(4) Based on the method, the invention provides a detection system and a detection device for automatically detecting the surface defects in the industrial product production process, thereby replacing the manual labor to a certain extent and saving the labor cost.

Drawings

Embodiments of the present invention are further described below with reference to the accompanying drawings:

FIG. 1 is a flow chart of a multi-level attention-based image object detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an embodiment of the present invention for attention branching;

FIG. 3 is a diagram illustrating the supervision messages during training of attention branches according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a multi-level attention-based image object detection system and apparatus according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

Fig. 1 is a flowchart of an image object detection method based on multi-level attention according to an embodiment of the present invention.

Referring to fig. 1, the image object detection method based on the multi-level attention mechanism of the present embodiment includes:

s1, constructing a feature extractor, called a backbone network, based on a deep convolutional neural network, inputting an image into the backbone network, extracting depth features of the image, and marking the depth features as M1;

s3, taking the depth characteristic M1 of the image obtained in the S1 as the input of an attention branch, outputting a multi-level attention weight graph, and recording as W;

s4, multiplying the output W of the S3 with the output M1 of the S1 to obtain a weighted feature map, and marking the weighted feature map as M2;

s5, taking the weighted characteristic diagram M2 of the S4 as the input of an RPN module to obtain a series of target candidate frames, and marking the target candidate frames as B1;

and S6, sending the weighted feature map corresponding to the target candidate frame output in the S5 into a classification and regression module, and finally obtaining a target detection frame which is marked as B2.

In a preferred embodiment, in S1, feature extraction is performed through a backbone network, where res net-50 may be used as the backbone network, each stage reduces the scale of the feature map through rolling and pooling operations, and amplifies the number of channels of the feature map, and finally, the output feature map of the fourth stage is selected as the depth feature M1 of the image, which is the output of the backbone network. Of course, in other embodiments, other networks may be employed and are not limited to ResNet-50.

In one embodiment, S1 may refer to the following operations:

s11, after preprocessing such as zooming is carried out on an input image, the input image is sequentially sent to a convolution layer and a pooling layer to obtain characteristics of a first stage, namely shallow characteristics, wherein the size of the shallow characteristics is reduced to half of the size of a preprocessed picture, and the number of channels is 64;

s12, rolling and pooling the shallow layer features to obtain features of a second stage, namely middle layer features, wherein the sizes of the middle layer features are reduced to half of those of the shallow layer features, and the number of channels is 128;

s13, rolling and pooling the middle layer features to obtain features in a third stage, namely deeper features, wherein the size of the deeper features is reduced to half of the size of the middle layer features, and the number of channels is 256;

s14, carrying out rolling and pooling operation on the deep features to obtain features in the fourth stage, namely deep features, wherein the deep features are used as output feature graphs of a backbone network, the sizes of the output feature graphs are reduced to half of the deep features, and the number of channels is 512.

In a preferred embodiment, in S2, to construct a branch network for extracting the multi-level attention weight graph, the method may include:

s21, the output characteristic diagram M1 of the S1 is convolved by 3x3, the dimension is reduced to 1, and the output characteristic diagram with the same dimension as M1 and the channel number of 1 is obtained;

s22, carrying out a convolution operation on the output characteristic diagram of S21 by 3x3 to obtain an attention weight diagram with a value between 0 and 1, wherein the attention weight diagram is taken as the output of an attention branch and is marked as W.

In this embodiment, different regions in the image are classified into different attention levels according to the degree of interest, and the different attention levels correspond to different values, and the higher the attention level, the greater the value. The repeated calculation of the feature map is avoided through the introduced branch structure, so that the processing efficiency is improved.

In a preferred embodiment, in S3, a multi-level attention weight graph is obtained by the attention branch, and in order to instruct the attention branch to generate a larger weight for the region of interest, corresponding supervision information should be provided for the attention branch. For example, in one embodiment, the region of interest is the contact area of the reed, as foreign matter is always present attached to the reed contact area. Specific:

s31, firstly, marking the training data set, and marking the position, the size and the type information of the object to be detected, namely the foreign matter in the reed contact area, and meanwhile marking the region attached to the object to be detected, namely the region of interest, namely the position, the size and the type information of the reed contact area in the example.

S32, generatingAnd (3) carrying out equal-proportion transformation on the position and the size of the region of interest, namely the reed contact region, in the training picture by using a zero matrix with the same scale as the output characteristic diagram M1 of the S1. For example, assume that the width and height of the picture after inputting S1 and preprocessing are W, H, respectively; the width and height of the output characteristic diagram M1 of S1 are W respectively ₁ 、H ₁ The method comprises the steps of carrying out a first treatment on the surface of the The position coordinates of the picture corresponding to the region of interest, i.e. the reed contact area, are ((x) ₁₁ ，y ₁₁ )，(x ₁₂ ，y ₁₂ ) The position coordinates of the transformed corresponding region are derived from the following formula:

x ₂₁ ＝x ₁₁ ·W ₁ /W

y ₂₁ ＝y ₁₁ ·H ₁ /H

x ₂₂ ＝x ₁₂ ·W ₁ /W

y ₂₂ ＝y ₁₂ ·H ₁ /H

wherein x is ₂₁ ，y ₂₁ ，x ₂₂ ，y ₂₂ Representing the upper left-hand abscissa, upper left-hand ordinate, lower right-hand abscissa and lower right-hand ordinate, respectively, of the transformed region of interest. The set of coordinates corresponds to the coordinates of the reed contact area after the equal ratio transformation.

S33, in the zero matrix, the position corresponding to the transformed region of interest, namely the reed contact region, is assigned to 1 according to the transformation coordinates obtained in the S32, and the region represents the first-level region of interest, so that the reality is that the foreign matter in the reed contact region influences the switch performance of the magnetic reed and is accurately detected. The fact that the other position on the same horizontal line as the primary region of interest is assigned 0.5, called the secondary region of interest, is that the foreign matter on the non-contact area of the reed temporarily does not affect the performance of the reed switch, but is likely to move later to the reed contact area and can therefore be detected with a lower confidence. The remaining area remains at 0, called the non-interest area, representing other areas, objects on the glass tube wall that are prone to false detection, such as dirt, fibers, etc., should be filtered out. Of course, other numerical assignment rules may be used in other embodiments, primarily to distinguish between different regions, which are merely illustrative and not intended to limit the invention.

S34, taking the matrix after assignment in the S33 as supervision information of the attention weight graph during training, and calculating the Loss of the attention branches through the Loss function, and marking the Loss as the Loss _a 。

The supervision information obtained in this embodiment can guide attention branches, and can generate a larger weight for the region of interest, so as to increase the probability of detecting the target in the region, and conversely, the region with a low probability of occurrence of the target such as the background can reduce the probability of detecting the target in the region due to the relatively smaller weight, so as to reduce false detection caused by interference such as the background.

In the preferred embodiment, the attention weight graph W output in S2 represents the degree of importance of different regions, the first region of interest, i.e. the reed contact region, has the greatest weight, the second region of interest, i.e. the reed non-contact region, has the next weight, and the other regions of no interest have weights of 0. In S4, the attention weight map W is multiplied by the feature map M1 output in S1 to obtain a weighted feature map, which is denoted by M2.

In the preferred embodiment, in the step S6, the classification and regression networks respectively obtain a Loss during training, and these two Loss are added to the Loss of the attention branch obtained in the step S3 to be the total Loss of the whole network, so as to implement end-to-end training. As shown in the following formula:

Loss＝Loss _a +Loss _cls +Loss _reg

where Loss represents the total Loss of the entire network _a Loss, loss representing attention branches _cls Loss, loss representing a classification network _reg Representing the Loss of the bounding box regression network.

In this embodiment, the loss of branches is added to the overall loss, so that end-to-end training is realized, and the training and reasoning of the network are simpler.

Based on the same technical concept, another embodiment of the present invention further provides an image object detection system based on a multi-level attention mechanism, including:

the feature extraction module is used for constructing a feature extractor based on the deep convolutional neural network and used as a backbone network, and inputting the image into the backbone network to extract the depth features of the image;

the multi-stage attention weight map acquisition module inputs the depth features of the images obtained by the feature extraction module into attention branches constructed by the attention branching module to obtain a multi-stage attention weight map;

The specific implementation technology of each module in the embodiment of the image target detection system based on multi-level attention of the present invention may refer to the corresponding steps of the method, and will not be described herein. The embodiment of the invention can meet the requirement of real-time detection, and is more suitable for application in industrial scenes.

Based on the above detection method and system, in another embodiment of the present invention, an image target detection device based on multi-level attention is provided, where the above image target detection method based on multi-level attention is used to implement a specific target detection task in an image. Specifically, the image object detection device based on the multi-level attention includes: the image acquisition module is used for capturing the target to be detected, and acquiring an image or video containing the target to be detected in a specific scene for subsequent detection; the detection module is used for detecting the image acquired by the image acquisition module to obtain a specific detection result and displaying or feeding back the detection result to the control module; the image object detection method based on the multi-level attention mechanism in any one of the above embodiments is adopted for detection.

Further, in order to make the above technical solution more clearly understood, the following description will be given by taking the case of applying to defect detection in industrial scenes as an example, but it should be understood that this example is not intended to limit the application of the present invention.

Specifically, please refer to fig. 4, the image target detection method based on multi-level attention is applied to defect detection in industrial scenes. The method is applied to defect detection in industrial scenes, and the detection target is tiny foreign matters on a reed contact area of a reed opening Guan Li. Since the glass tube wall of the reed switch often has dirt or some tiny fibers, the characteristics of the dirt or the tiny fibers are similar to those of the foreign matter on the contact area of the reed, and the dirt or the tiny fibers are difficult to completely distinguish between the dirt and the foreign matter by using a traditional target detection algorithm. These objectives do not affect the function of the reed switch, and therefore there are a large number of false detections in the detection process. In view of this, a product defect detection apparatus based on multi-level attention is adopted in the present embodiment, including: the system comprises a mechanical transmission module, an image acquisition module, a detection module and a software and hardware communication module. The mechanical transmission module conveys, rotates and grabs a product to be detected (a magnetic reed switch); the image acquisition module acquires images of conveyed and rotated products to be detected (magnetic reed switches) in the working process of the mechanical transmission module; the detection module processes and analyzes the image acquired by the image acquisition module, specifically adopts the image target detection method based on the multi-level attention in the embodiment to obtain a specific detection result, feeds the detection result back to the mechanical transmission module, and the mechanical transmission module performs classified grabbing on the product to be detected (the magnetic reed switch) according to the fed-back detection result. Further, the apparatus may further comprise a communication module for communication between the mechanical transmission module and the detection module.

Specifically, in a preferred embodiment, the mechanical transmission module performs conveying, rotating and grabbing of the product to be detected, and may include:

and a material conveying module: the reed switch is automatically conveyed in a pipeline manner so as to be detected and classified.

And a material rotation module: the magnetic reed switch is subjected to grabbing, rotating and other operations, the magnetic reed switch can be rotated by any angle according to the axis, continuous snapshot can be carried out on the magnetic reed switch in the rotating process, and image acquisition can be carried out, so that the magnetic reed switch can be detected in an omnibearing manner without dead angles.

And a material separating and taking module: the magnetic reed switch is classified according to the detection result, when the magnetic reed switch is sent to the tail end of the conveyor belt by the conveying module, the material separating and taking module is used for grabbing the magnetic reed switch and throwing the magnetic reed switch according to the detection result, the magnetic reed switch is classified into good products and defective products, and the defective products are further classified according to specific defect types.

Programmable logic controller: the device is used for controlling the operation of the whole machine, comprising controlling the operation of a material conveying module, a material rotating module and a material separating module, forming an operable interface, and being capable of controlling the mechanical device, setting parameters and the like.

Specifically, in a preferred embodiment, the image capturing module may employ an optical image capturing device, and may include: the optical microscope is used for imaging the surface condition of the magnetic reed switch, and can amplify and observe industrial products with a certain multiplying power for image acquisition. The light source system comprises a light source and a light source controller for providing good lighting conditions for the optical microscope. The industrial camera is used for capturing the optical image obtained by the optical microscope to form a series of images or videos for subsequent detection.

Of course, in other embodiments, the optical image acquisition device, including but not limited to a visual microscope, a monitoring probe, etc., may alternatively be used to optically image and acquire as a digital image an object to be inspected, such as an industrial product.

Specifically, in a preferred embodiment, the detection module may specifically include two parts, namely hardware and software, where the hardware uses a computer, such as a high-performance GPU computer, and the computer is configured to run the detection software, detect an image acquired by the image acquisition module, feed back an obtained detection result to the control system, feed back the detection result obtained by the operation to the control system, such as an industrial material sorting module, a security alarm linkage module, and so on, and finally implement defective product sorting, abnormal situation acousto-optic alarm, and so on for the industrial product. In this embodiment, the detection result can be fed back to the sorting module, and finally detection and sorting of the reed switch are realized. The detection software mainly detects the acquired pictures and comprises a graphical user interface, so that a user can view real-time pictures acquired by the image acquisition module in real time, and the detection result of each time can be presented.

Specifically, in a preferred embodiment, the software and hardware communication module is a signal conversion and transmission module, including but not limited to a switching value module, which can be used for signal conversion and communication between the control system and the high-performance GPU computer. The signal conversion means conversion from digital quantity to analog quantity or switching quantity, and is used for realizing signal control of mechanical device. In this embodiment, a switching value module may be used, and a high module is used for communication between mechanical devices such as a material rotation module and a material sorting module and a high-performance GPU computer. Specifically, the material rotation module can send a signal for starting detection to the computer through the switching value module when starting rotation, so that the computer can start detection of the current magnetic reed switch; after the computer finishes detecting one magnetic reed switch, the computer can send a detection result to the material sorting module through the switching value module, so that the magnetic reed switch can be sorted.

In another embodiment of the present invention, there is also provided an abnormality detection apparatus based on a memory mechanism, including at least one processor, and at least one memory, where the memory stores a computer program that, when executed by the processor, enables the processor to perform the abnormality detection method based on a memory mechanism of any one of the above embodiments.

In another embodiment of the present invention, there is also provided a computer-readable storage medium, which when executed by a processor within a device, causes the device to perform the method of anomaly detection based on a logging mechanism as described in any one of the above.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, or as a computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. An image target detection method based on a multi-level attention mechanism, comprising:

s6, sending the weighted feature images corresponding to the target candidate frames to a classification and regression module, and finally obtaining target detection frames;

the S3 further includes providing supervision information for the attention branches, including:

2. The multi-level attention mechanism based image object detection method of claim 1, wherein S2 comprises:

3. The multi-level attention mechanism based image object detection method as claimed in claim 1, wherein in S32, assume that the width and height of the original picture input to S1 are W, H, respectively; s1, the width and the height of the output characteristic diagram are W respectively ₁ 、H ₁ The method comprises the steps of carrying out a first treatment on the surface of the The picture corresponds to the sense of happinessThe position coordinates of the interest area are ((x) ₁₁ ，y ₁₁ )，(x ₁₂ ，y ₁₂ ) The position coordinates of the transformed corresponding region are derived from the following formula:

x ₂₁ ＝x ₁₁ ·W ₁ /W

y ₂₁ ＝y ₁₁ ·H ₁ /H

x ₂₂ ＝x ₁₂ ·W ₁ /W

y ₂₂ ＝y ₁₂ ·H ₁ /H

wherein x is ₂₁ ，y ₂₁ ，x ₂₂ ，y ₂₂ Representing the upper left-hand abscissa, upper left-hand ordinate, lower right-hand abscissa and lower right-hand ordinate, respectively, of the transformed region of interest.

4. The method for detecting an image object based on a multi-level attention mechanism as set forth in claim 1 wherein in S33, a larger value ranging from 0 to 1 is assigned to the position corresponding to the transformed region of interest, where the region represents the first-level region of interest; assigning a smaller value with the value range of 0 to 1 to other positions which are positioned on the same horizontal line with the primary region of interest, and calling the smaller value as a secondary region of interest; the remaining area remains at 0, referred to as the region of no interest.

5. The method for detecting an image object based on a multi-level attention mechanism according to claim 1, wherein in S6, the classification and regression module includes:

the classification network is used for classifying the weighted feature images corresponding to the target candidate frames and outputting specific categories of the weighted feature images;

the regression network is used for finely adjusting the position of the target candidate frame;

the classification and regression networks respectively obtain a Loss during training, and the two Loss are added with the Loss of the attention branch of S3 to be used as the total Loss of the whole network, so that the end-to-end training is realized.

6. An image object detection system based on a multi-level attention mechanism, comprising:

the classification and regression module is used for classifying and regressing according to the weighted feature images corresponding to the target candidate frames obtained by the target candidate frame obtaining module to obtain target detection frames;

the multi-level attention weight graph acquisition module further includes providing supervision information for the attention branches, including:

collecting a large number of images containing the object to be detected, constructing a training data set, marking the training data set, and marking the position, the size and the category information corresponding to the object to be detected, and the position, the size and the category information of the area where the object to be detected is attached, namely the area of interest;

generating a zero matrix with the same depth characteristic scale as the image of the characteristic extraction module, and carrying out equal-proportion transformation on the position and the size of the region of interest in the training image to obtain transformation coordinates;

in the zero matrix, according to the obtained transformation coordinates, the positions corresponding to the transformed regions of interest are respectively assigned according to the primary regions of interest, the secondary regions of interest and the non-regions of interest, and the different regions correspond to different values;

the matrix after being assigned is used as the supervision information of the attention weight graph during training, and the Loss function is used for calculating the Loss of attention branches, and the Loss is recorded as the Loss _a 。

7. An image object detection apparatus based on multi-level attention, comprising:

the detection module is used for detecting the image acquired by the image acquisition module to obtain a specific detection result and displaying or feeding back the detection result to the control module; the detection adopts the image target detection method based on the multi-level attention mechanism as claimed in any one of claims 1 to 5.

8. A computer device comprising at least one processor, and at least one memory, wherein the memory stores a computer program that, when executed by the processor, enables the processor to perform the multi-level attention mechanism based image object detection method of any one of claims 1 to 5.

9. A computer readable storage medium, which when executed by a processor within a device, causes the device to perform the multi-level attention mechanism based image object detection method of any one of claims 1 to 5.