CN110598609A

CN110598609A - Weak supervision target detection method based on significance guidance

Info

Publication number: CN110598609A
Application number: CN201910824612.5A
Authority: CN
Inventors: 赵丹培; 袁志超; 史振威; 姜志国; 谢凤英; 张浩鹏
Original assignee: Beijing University of Aeronautics and Astronautics
Current assignee: Beijing University of Aeronautics and Astronautics
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2019-12-20
Anticipated expiration: 2039-09-02
Also published as: CN110598609B

Abstract

The invention discloses a salient-guided weak supervision target detection method, which utilizes a salient region in an image as guide information of target detection and combines image-level labels to form pseudo labels for training a supervision depth target detection network. In the training process, only the image-level labeled samples need to be provided for the network, a large amount of image labeling work can be omitted, the time and labor cost is reduced, and the method is suitable for actual engineering requirements. The model does not need additional algorithm to provide area proposal, can finish target detection by using a single deep network after finishing network training, is simple and convenient to use, has high operation speed, and has particularly remarkable advantage of detection precision on images containing single targets.

Description

Weak supervision target detection method based on significance guidance

Technical Field

The invention relates to the technical field of computer vision and image processing, in particular to a weak supervision target detection method guided by significance.

Background

With the development of CNNs, many target detection algorithms have emerged. Although these CNN-based target detection algorithms can achieve high detection accuracy, they all require training support by relying on a large number of samples of object-level labels (drawing a target box for each target). Moreover, for different detection tasks, completely different databases need to be constructed for training. In practical applications, acquiring a large number of training samples sometimes requires a large expenditure of labor and time, and sometimes is completely inaccessible. This has become a bottleneck in applying CNN-based target detection algorithms.

In order to solve the problem that object-level labels are difficult to obtain, a target detection algorithm based on weak supervised learning is developed. This type of algorithm is also based on CNN, but differs in that object-level labels are no longer used in the training process, but rather image-level labels (only labeling whether there is an object in the image). On the one hand, when manual labeling is carried out, the difficulty of image-level labeling is far lower than that of object-level labeling, and a training data set can be constructed with higher efficiency. On the other hand, due to the existence of the search engine, people can easily obtain the sample with the specific image-level label even through the network, and the workload of constructing the data set is further reduced.

The current target detection algorithm can be mainly divided into a traditional target detection method and a target detection method based on deep learning. The main idea of the traditional target detection algorithm is to extract the features of the target and distinguish the target from the background, and the following three methods are mainly used: the method comprises an image processing-based target detection method, a visual saliency-based target detection method and a machine learning-based target detection method. Traditional methods, while continually evolving, are always limited by the ability to manually extract features. The advent of deep learning has completely changed the landscape of target detection domain algorithms. Target detection based on deep learning mainly develops two directions: the target detection method based on the regional proposal and the target detection method based on regression have the advantages that the detection precision of the former method is higher, and the detection speed of the latter method is higher.

The target detection method based on deep learning is far better than the traditional method in precision and time, but has a common problem of CNN, namely that a large number of object-level labeled samples are required for training. Obtaining a large number of samples marked at object level is very difficult, requires a lot of manpower for manual labeling, also takes a long time, and the time and manpower costs severely limit the wide application of deep learning in practical engineering.

Therefore, the problem that needs to be solved by those skilled in the art is how to provide a new method for detecting a target through weak supervised learning, which can efficiently and accurately complete a target detection task without using a large-scale object-level data set.

Disclosure of Invention

In view of this, the present invention provides a method for detecting a target under weak supervision with saliency guidance, and in the field of target detection, weak supervision learning mainly refers to a method for training by using only a sample labeled at an image level without using a sample labeled at an object level. Different from the object-level labeling, the image-level labeling only provides the class labels in the image, i.e. the objects of which class are contained in the image, but does not provide other information such as the positions and the number of the objects, and the labeling mode remarkably reduces the workload when the training data set is constructed. The method uses the result of the significance detection as guide information, and combines with image-level labels to form pseudo labels for supervising the training of the depth target detection network. In the training process, only image-level labeled samples need to be provided for the network, a large amount of image labeling work can be omitted, the time and labor cost is low, and the method is suitable for the actual engineering environment. The model does not need an additional algorithm to provide an area proposal, can finish detection by using a single deep network after finishing the training of the network, is simple and convenient to use, has high operation speed, and has the particularly prominent advantage of detecting precision on an image of a single target.

In order to achieve the above purpose, the invention provides the following technical scheme:

a method for detecting a weakly supervised target guided by significance comprises the following specific steps:

step 1: carrying out significance detection on the input training image by utilizing a learning promotion significance model to obtain a visual significance detection result of the input training image;

step 2: segmenting a visual saliency detection result by using a self-adaptive threshold, converting a grayscale saliency map into a binary saliency map, and removing isolated noise points by using morphological operation;

and step 3: taking the boundary of the significance map in the binary form generated in the step 2 as the position of a target in the training image, combining target category information provided by image-level labels, constructing a pseudo label simultaneously containing the target category and the position information, and matching and storing the pseudo label with the training image; judging whether the generation of pseudo labels of all training images is finished, namely, checking whether the traversal of the images in the training set list is finished, if so, entering a network detection training stage, executing a step 4, otherwise, continuously generating the pseudo labels, and executing a step 1;

and 4, step 4: carrying out feature extraction on the input convolutional neural network with the pseudo label to obtain a multi-scale feature map, carrying out intensive sampling on the multi-scale feature map, and carrying out primary refining on the prediction frame through the pseudo label;

and 5: fusing multi-scale feature maps through deconvolution, performing full supervision training on the fused feature maps by using pseudo labels, classifying and regressing refined detection frames, and executing a step 6 when the number of training rounds reaches a set number, or executing a step 4;

step 6: and inputting the image to be detected into the trained convolutional neural network to obtain a detection result.

Preferably, in the above method for detecting a weakly supervised target guided by saliency, the step 1 is used to obtain a visual saliency detection result of the input training image, and the method specifically includes the following steps:

s11: for a training image X_mM denotes the m-th image, each pixel of which is denoted x_mnN denotes the nth pixel in the image; training image X_mDivided into two regions C_m1And C_m2Respectively representing a salient region and a background region, X_m＝C_m1∪C_m2；

S12: an embedded function model is constructed by utilizing a deep neural network phi containing a parameter theta, and each pixel of an input training image is mapped into a D-dimensional vector

φ_mn＝φ(x_mn；θ) (1)

Mapping both salient and background regions to a D-dimensional matrix mu using a deep neural network psi containing a parameter eta_mk：μ_mk＝ψ(C_mk；η)，k＝1，2 (2)；

C_mkIs a training image X_mWherein k 1 represents a salient region and k 2 represents a background region;

s13: pixel x_mnFalls in the region C_mkThe probability of (d) is expressed by a Softmax function:

wherein phi_mnAnd mu_mkRespectively representing projection results obtained by calculation of the formula (1) and the formula (2), and d (-) represents Euclidean distance;

s14: defining a loss function:wherein t is_mnIs an indicator variable, t_mn1 means that the pixel belongs to a salient region, i.e. x_mn∈C_m1，t_mn0 means that the pixel belongs to the background region, i.e. x_mn∈C_m2(ii) a Optimization of loss function using gradient descent methodAnd (5) converting to obtain a significance detection result.

Preferably, in the above method for detecting a weakly supervised target guided by saliency, the step 2 converts the saliency map in a grayscale form into the saliency map in a binary form by means of threshold segmentation, and removes noise points by morphological operations, and the method specifically includes:

s21: saliency map I for grayscale form_gObtaining an adaptive threshold value T by means of maximum inter-class variance, and obtaining a significance map I of a binary form by threshold segmentation_b：

S22: and through open operation, firstly expanding and then corroding, and removing noise points.

Preferably, in the above method for detecting a weakly supervised target guided by saliency, the step 3 provides target position information with a saliency map in a binary form, generates a pseudo label by combining with image-level labels, and determines whether to complete labeling of all training images, and the specific steps include:

s31: significance map I 'for morphologically processed binary form'_bIs divided into a significant region S₁And an insignificant area S₂And l'_b＝S₁∪S₂Finding the smallest rectangular areaSo thatAnd an arbitrary rectangular regionAll can not satisfyWherein m is₀Number of minimum region, m_kIs the serial number of the other area; getFour vertices of { x₁，y₁，x₂，y₂And the image level label C of the training image forms a pseudo label L of the image_m＝{x₁，y₁，x₂，y₂，C}；

S32: and (3) when all the training images finish the pseudo-labeling, entering the training of the next stage, otherwise, continuing to perform the step 1 to finish the generation of the pseudo-labeling.

Preferably, in the above method for detecting a weakly supervised target guided by saliency, the step 4 utilizes a convolutional neural network to perform feature extraction, proposes an anchor point for prediction, and refines the anchor point by using pseudo labeling, and includes the specific steps of:

s41: extracting characteristics: VGG16 is used as a basic network for feature extraction, and a plurality of convolution layers are additionally added; the size of the input image is scaled to H; finally, a total of 4 specific feature maps were extracted from Conv4_3, Conv5_3, Conv7 and the additional convolutional layer, the resolution of each feature map being

S42: multi-scale dense sampling of anchor points: sampling anchor points on feature maps of different scales simultaneously; on the extracted 4 characteristic maps, obtainingAn anchor point;

s43: and (3) carrying out secondary classification and primary regression on the obtained anchor points: performing target or background classification in the anchor point refining module, and performing fine adjustment on the position; defining a refining loss function

Wherein the content of the first and second substances,andrespectively a two-classification loss function and a regression loss function,is the class of the object within the anchor point,indicates that the anchor belongs to the background, andindicating that anchor belongs to a certain class of objects, p_iJudging whether the anchor point is the probability of the target; x is the number of_iThe coordinates of the anchor point are represented by,representing the position of the target in the pseudo label; n is a radical of_aRepresenting the number of anchor points in the anchor point refining module;

using two classes of L₂Loss function:

if it isThen l_i1, otherwise_i＝0；

Use ofFunction:

wherein:

andrepresenting the offset of the anchor point and the pseudo label after refining relative to the anchor point before refining, i representing a serial number, k taking x, y, w and h and respectively representing the abscissa, the ordinate, the width and the height of the anchor point and the pseudo label; are respectively defined as:

wherein x is_i、x_ai、Respectively representing the positions of the anchor point after refining, the anchor point before refining and the pseudo label. The criteria for positive sample selection is that the intersection ratio of target pseudo labels of any class IoU > 0.5:

wherein TP is a region that is a target in both prediction and annotation, FP represents a region that is a target in prediction and is an annotation in background, FN represents a region that is a background in prediction and is an annotation in object;

the criteria for negative sample selection is to select the highest scoring among all negative samples, and the number is 3 times the number of positive samples.

Preferably, in the above method for detecting a weakly supervised target guided by saliency, the step 5 of fusing the multi-scale feature map increases the scale of the deep feature map by deconvolution, and fuses the deep feature map with the shallow feature map, and the specific steps include:

and S51 feature map fusion: for 4 feature maps extracted by Conv4_3, Conv5_3 and Conv7 and the additionally added convolutional layers, increasing the scale of one with smaller scale by deconvolution, and linearly adding the feature map with the previous adjacent scale to obtain a fused feature map; 4 fused feature maps are obtained according to the method, dense sampling of anchor points is not carried out on the fused feature maps, and the anchor points refined in the step 4 are used as anchor point sampling points in the step;

s52: and (3) multi-class classification regression of the anchor points: on the basis of the anchor points subjected to the primary classification regression, further performing fine multi-class classification regression; defining a detection loss function

Wherein the content of the first and second substances,andrespectively a two-classification loss function and a regression loss function,is the class of the object within the anchor point,indicates that the anchor belongs to the background, andindicating that anchor points belong to a certain class of objects，c_iIs the probability that the object belongs to class C; t is t_iThe coordinates representing the predicted position are then calculated,representing the position of the target in the pseudo label; n is a radical of_oRepresenting the number of anchor points in the anchor point refining module;

using L₂Loss function:

use ofFunction, synchronization step 4;

s53: constructing a multitask loss function together:

wherein p is_i、x_i、c_i、t_iRespectively representing the probability of whether the anchor point is the target, the position of the anchor point, the probability of the target belonging to the class C and the coordinate of the predicted position; and performing cascade training by using the pseudo labels, setting a total training round number E, and finishing the training of the network when the training round number reaches E.

Preferably, in the above method for detecting a weakly supervised target guided by saliency, the step of inputting the image to be detected into the trained convolutional neural network includes: inputting an image to be detected into a network, and scaling the image to be detected into a specified size by the network; extracting 4 characteristic graphs through convolution, carrying out intensive sampling on the characteristic graphs to obtain anchor points, screening the anchor points subjected to intensive sampling through an anchor point refining module to remove simple negative samples, and finely adjusting the positions of the anchor points; fusing the feature maps by using deconvolution to obtain 4 feature maps fused with deep layers, and classifying and regressing by using refined anchor points to obtain a final detection position; and finally, returning the detection result to the size of the original image according to the proportion, namely the detected target position.

According to the technical scheme, compared with the prior art, the method for detecting the weakly supervised target by the saliency guide has the advantages that the result of the saliency detection is used as guide information, and the pseudo label is formed by combining the image-level label and is used for training the supervised deep target detection network. In the training process, only image-level labeled samples need to be provided for the network, a large amount of image labeling work can be omitted, the time and labor cost is low, and the method is suitable for the actual engineering environment. The model does not need an additional algorithm to provide an area proposal, can finish detection by using a single deep network after finishing the training of the network, is simple and convenient to use, has high operation speed, and has the particularly prominent advantage of detecting precision on an image of a single target.

The invention has the following advantages and beneficial effects:

(1) the invention is based on the regression-based deep learning target detection network, does not use an additional region proposing algorithm or network, completes the network training of the target detection part in one step, and has high training speed.

(2) The invention trains the network by adopting a weak supervision learning mode, and the data set used for training is image-level labeled data, namely only the category of the target in the image is labeled, but not the category and the position of each target. The labeling work of the data set is easier to complete, and the cost of consumed labor and time is low.

(3) The invention adopts the visual saliency as a guide, provides position information for the image labeled at the image level to form a pseudo label, accords with the visual judgment of human brain, and is beneficial to improving the accuracy of target detection in images containing different types of targets in different environments.

(4) The method adopts a multi-scale feature fusion feature extraction strategy, and the shallow feature map has high-resolution information, so that small targets can be detected conveniently; the deep characteristic map has high-level semantic information, large targets are favorably detected, the deep characteristic map and the shallow characteristic map are fused in a deconvolution mode, high-resolution information and strong semantic information can be fused and complemented, and the small target detection precision is improved.

(5) The weak supervision target detection method based on the significance guidance can accurately detect various targets under different scales and different environments, and has better robustness.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a salient guidance-based weakly supervised target detection algorithm of the present invention;

FIG. 2 is a flow diagram of the pseudo annotation generation based on saliency steering of the present invention;

FIG. 3 is a deep feature extraction network based on multi-scale feature fusion in the present invention;

FIG. 4 is a detection effect diagram of a weak supervision target detection algorithm based on significance guidance under different conditions of background, illumination, scale and the like;

FIG. 5 is a detection effect diagram of the weakly supervised target detection algorithm based on saliency guidance in the invention under different scales and different fuzzy degrees.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a saliency-guided weak supervision target detection method, which utilizes a saliency detection result as guide information and combines image-level labels to form pseudo labels for training a supervision depth target detection network. In the training process, only image-level labeled samples need to be provided for the network, a large amount of image labeling work can be omitted, the time and labor cost is low, and the method is suitable for the actual engineering environment. The model does not need an additional algorithm to provide an area proposal, can finish detection by using a single deep network after finishing the training of the network, is simple and convenient to use, has high operation speed, and has the particularly prominent advantage of detecting precision on an image of a single target.

Referring to fig. 1, the method for detecting a target by using a weak supervised learning training target utilizes significance to guide generation of pseudo labels, and comprises the following specific implementation steps:

step 1: carrying out saliency detection on an input training image by utilizing a visual saliency model, such as a Full Connected Network (FCN), and obtaining a visual saliency detection result of the input training image;

for a training image X_mM denotes the m-th image, each pixel of which is denoted x_mnAnd n denotes an nth pixel in the image. Training image X_mDivided into two regions C_m1And C_m2Respectively representing a salient region and a background region, X_m＝C_m1∪C_m2；

An embedded function model is constructed by using Deep Neural Networks (DNN) phi containing a parameter theta, and each pixel of an input training image is mapped into a D-dimensional vector

φ_mn＝φ(x_mn；θ)

Mapping both salient and background regions to a D-dimensional matrix μ using DNN ψ containing a parameter η_mk：

μ_mk＝ψ(C_mk；η)，k＝1，2

C_mkIs a training image X_mWherein k-1 represents a salient region and k-2 represents a background region.

Pixel x_mnFalls in the region C_mkThe probability of (d) can be expressed by a Softmax function:

defining a loss function:

wherein t is_mnIs an indicator variable, t_mn1 means that the pixel belongs to a salient region, i.e. x_mn∈C_m1，t_mn0 means that the pixel belongs to the background region, i.e. x_mn∈C_m2(ii) a By optimizing the loss function by using a gradient descent method, the network can have the capability of classifying whether the pixels are obvious or not.

In the detection process, a rough significance map is calculated by a priori method, and then the model is used for iterative optimization to obtain a significance detection result.

Step 2: the visual saliency detection result is segmented by using a self-adaptive threshold, a grayscale saliency map is converted into a binary saliency map, isolated noise points are removed by using morphological operation, and the quality of the saliency map is improved;

saliency map I for grayscale form_gObtaining an adaptive threshold value T by means of maximum inter-class variance, and obtaining a significance map I of a binary form by threshold segmentation_b：

Many isolated noise points often exist on the significance map of the binary form, the noise points can influence the generation of pseudo labels, and the small points can be removed through open operation, namely expansion and corrosion.

And step 3: taking the boundary of the significance map in the binary form generated in the step 2 as the position of a target in the training image, combining target category information provided by image-level labels, constructing a pseudo label simultaneously containing target category and position information, and matching and storing the pseudo label with the training image; judging whether the generation of pseudo labels of all training images is finished, namely, checking whether the traversal of the images in the training set list is finished, if so, entering a network detection training stage, executing a step 4, otherwise, continuously generating the pseudo labels, and executing a step 1;

the target to be detected in the image often shows significance, so that the result of the significance detection can roughly represent the position of the target to be detected; significance map I 'for morphologically processed binary form'_bIs divided into a significant region S₁And an insignificant area S₂And l'_b＝S₁∪S₂Finding the smallest rectangular areaSo thatAnd an arbitrary rectangular regionAll can not satisfyWherein m is₀Number of minimum region, m_kIs the serial number of the other area; getFour vertices of { x₁，y₁，x₂，y₂And the combined image and the image-level label C of the training image form an imagePseudo label L of_m＝{x₁，y₁，x₂，y₂C }; and (3) when all the training images finish the pseudo-labeling, entering the training of the next stage, otherwise, continuing to perform the step 1 to finish the generation of the pseudo-labeling.

And 4, step 4: inputting a training image into an anchor point refining module of a target detection network, extracting the features of the image by using a VGG16 network, and taking 4 specific feature maps out of the network, wherein the 4 feature maps have different scales, namely different resolutions, the feature map with less convolution times is a shallow feature map with high resolution, and the feature map with more convolution times is a deep feature map with rich semantic information; carrying out intensive sampling on each feature map, obtaining anchor points with different sizes and aspect ratios, and grading and screening the anchor points by using the pseudo labels generated in the steps 1 to 3 to remove simple negative samples;

feature extraction network VGG 16:

VGG16 is used as a basic network for feature extraction, and a plurality of convolution layers are additionally added; the size of the input image is scaled to H × H, which is generally 320; finally, a total of 4 signatures were extracted from Conv4_3, Conv5_3, Conv7 and the additional convolutional layer, the resolutions of which were respectively When H is 320, the resolutions of the feature maps are 40 × 40, 20 × 20, 10 × 10, and 5 × 5, respectively;

multi-scale dense sampling of anchor points:

the shallow feature map is used for detecting a large target, the deep feature map is used for detecting a small target, and in order to meet the detection requirements of the large target and the small target at the same time, anchor points are sampled on the feature maps with different scales at the same time; on the extracted 4 characteristic diagrams, taking each pixel as a center, taking 3 anchor points with the aspect ratios of 1:1, 1:2 and 2:1 on each pixel point, and obtaining the anchor points on the extracted 4 characteristic diagrams in total according to the methodAnchor point, when H is 320, N is 6375;

two classifications and preliminary regression of anchors:

the number of anchor points for intensive sampling is large, and the anchor points are classified into two categories of targets or backgrounds in an anchor point refining module and subjected to fine adjustment of positions; defining a refining loss function

using two classes of L₂Loss function:

if it isThen l_i1, otherwise_i＝0；

Use ofFunction:

wherein:

wherein x is_i、x_ai、Respectively representing the positions of the anchor point after refining, the anchor point before refining and the pseudo label.

The criteria for positive sample selection is a cross-over ratio IoU > 0.5 with any class of target pseudo labels:

And 5: the scale of the deep characteristic diagram is increased in a deconvolution mode, the deep characteristic diagram is fused with the shallow characteristic diagram to obtain 4 fused characteristic diagrams, the characteristic vector of a refining anchor point is extracted from the fused characteristic diagrams, and full supervision learning is carried out by using pseudo labeling; stopping training when the number of learning rounds reaches a set threshold value, and performing step 6, otherwise, continuing training and performing step 4;

deconvolution module and fusion feature map:

for 4 feature maps extracted by Conv4_3, Conv5_3 and Conv7 and the additionally added convolutional layers, increasing the scale of one with smaller scale by deconvolution, and linearly adding the feature map with the previous adjacent scale to obtain a fused feature map; 4 fused feature maps are obtained according to the method, dense sampling of anchor points is not carried out on the fused feature maps, and the anchor points refined in the step 4 are used as anchor point sampling points in the step;

and (3) multi-class classification regression of the anchor points:

on the basis of the anchor points subjected to the primary classification regression, further performing fine multi-class classification regression; defining a detection loss function

Wherein the content of the first and second substances,andrespectively a two-classification loss function and a regression loss function,is the class of the object within the anchor point,indicates that the anchor belongs to the background, andindicating that an anchor belongs to a certain class of objects, c_iIs the probability that the object belongs to class C; t is t_iThe coordinates representing the predicted position are then calculated,representing the position of the target in the pseudo label; n is a radical of_oRepresenting the number of anchor points in the anchor point refining module;

using L₂Loss function:

use ofFunction, synchronization step 4;

step 4 and step 5 jointly construct a multitask loss function:

wherein p is_i、x_i、c_i、t_iRepresenting the probability of whether the anchor point is the target, the location of the anchor point, the probability of the target belonging to category C, and the coordinates of the predicted location, respectively. And performing cascade training by using the pseudo labels, setting a total training round number E, and finishing the training of the network when the training round number reaches E. And performing cascade training by using the pseudo labels, setting a total training round number E, and finishing the training of the network when the training round number reaches E.

Step 6: after training is finished, the significance detection of the test image is not needed, the image is input into the detection network constructed in the steps 4 to 5, and a target detection result can be directly obtained.

Inputting an image to be detected into a network, and scaling the image to be detected into a specified size by the network, wherein the side length H is 320 square to meet the requirement of the model; extracting 4 characteristic graphs through convolution, carrying out intensive sampling on the characteristic graphs to obtain anchor points, screening the anchor points subjected to intensive sampling through an anchor point refining module to remove simple negative samples, and finely adjusting the positions of the anchor points; fusing the feature maps by using deconvolution to obtain 4 feature maps fused with deep semantic information, and classifying and regressing by using the refined anchor points to obtain a final detection position; and finally, returning the detection result to the size of the original image according to the proportion, namely the detected target position.

FIG. 2 is a flow chart of pseudo annotation generation based on saliency guidance, wherein the pseudo annotation is obtained by image-level annotation of an input image, and a saliency map obtained by saliency detection is obtained by threshold segmentation and morphological processing. First, a saliency map of the input image is obtained by saliency detection, as shown in fig. 2 (b). The saliency map is a grayscale image in which the more salient portions have larger grayscale values. Generally, the target to be detected is often a more prominent part in the map, so we need to extract a region with a larger gray scale in the prominent map. Through threshold segmentation, the gray level image can be converted into a binary image, and an area with a large gray level value in the image is separated. As shown in fig. 2(c), the binary image obtained by threshold segmentation often contains some noises of non-target regions, because the saliency detection only judges whether an object in the image is salient or not, and some regions close to the target position and color are easily considered to be salient at the same time. But this will have an impact on the generation of pseudo labels, so we use morphological operations to remove the fine noise. Firstly, removing isolated points in the binary image, and then carrying out open operation on the rest images, namely, firstly corroding and then expanding. This enables fine noise to be removed without affecting the size of the target position. After morphological operation, as shown in fig. 2(d), noise in the visible image is removed, and the rough position of the target can be obtained by taking the outer edge of the portion with the value of 1 in the binary image, but the category of the target is not known at this time. The label of the image level, i.e. the type information of the image, is also input simultaneously with the image, and at this time, the pseudo-label information with the type label can be obtained by giving the type information to the image, as shown in fig. 2 (f).

FIG. 3 is a deep feature extraction network based on multi-scale feature fusion in the present invention; the network is divided into two parts, and classification regression is carried out for two times. The first part is an anchor point refining module which, as the name implies, performs refining operations on a large number of anchor points proposed by intensive sampling. And 6375 anchor points are proposed on the feature map with 4 scales, so that the difficulty of directly predicting among a large number of anchor points is high, and most of the anchor points are negative samples, namely, a serious problem of unbalance of positive and negative samples exists, so that the network is difficult to effectively learn the features of the target. Therefore, the anchor points need to be selected and refined, the anchor point refining module only uses the original feature map which is not fused with the deep feature map, the category of the target is not distinguished, and only two classifications of the target and the background are carried out, so that the difficulty of classification regression can be reduced, a certain number of negative samples are eliminated, and the problem of unbalanced number of positive and negative samples is solved.

The second part is a target detection module, the network of the first part uses a deep-layer and shallow-layer combined feature map, the feature extraction capability and the target positioning capability are both enhanced, a large number of anchor points belonging to the background are eliminated through an anchor point refining module, the problem of unbalance of positive and negative samples is solved to a certain extent, the network can extract the features of the targets more easily, and the network carries out classification and regression of various targets in the refined anchor points to obtain the final prediction result.

Fig. 4 is a detection effect diagram of a part of the weak supervision target detection algorithm based on saliency guidance in the invention under different conditions of background, illumination and the like. Fig. 4(a) is a detection result of an image with a simple background, and as can be seen from the detection result, in an image with a monotonous background, the detection accuracy of the present invention is high; fig. 4(b) is a detection result in an image with light and shadow interference, where the light and shadow interference is a common scene in the image, and in the image with light and shadow interference, the color and texture of the target may be interfered and changed, which affects the target detection effect, and it can be seen from the detection result that the present invention can still maintain high detection accuracy in the presence of light and shadow interference; fig. 4(c) shows a situation that the target is blocked or incompletely photographed, and the incompletely photographed target may lose many features and details of the target, which may easily cause missed detection or inaccurate positioning.

FIG. 5 is a detection result of the weakly supervised target detection algorithm based on saliency guidance in the present invention at different scales and different degrees of ambiguity. FIG. 5(a) is a detection result obtained by artificially reducing an image to different scales, and the small target has fewer texture features and higher detection difficulty, so that the detection model of the invention can accurately detect the small target; fig. 5(b) is a detection result obtained after blurring is added to an original image, the blurred image is a common situation in engineering practice, blurring can cause that textures and edges of a target are difficult to extract, and detection quality is reduced.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for detecting a weakly supervised target guided by significance is characterized by comprising the following specific steps:

step 1: carrying out significance detection on the input training image by using a visual significance model to obtain a visual significance detection result of the input training image;

step 2: segmenting a visual saliency detection result by using an adaptive threshold, converting a grayscale saliency map into a binary saliency map, and removing isolated noise points by using morphological operation to realize the optimization of the saliency map;

and step 3: taking the boundary of the significance map in the binary form generated in the step 2 as an estimated position of a target in a training image, constructing a pseudo label simultaneously containing target type and position information by combining target type information provided by image-level labels, and matching and storing the pseudo label with the training image; judging whether the generation of pseudo labels of all training images is finished, if so, entering a detection network training stage, executing the step 4, otherwise, continuously generating the pseudo labels, and executing the step 1;

and 4, step 4: inputting the training image with the pseudo label into a convolutional neural network for feature extraction to obtain a multi-scale feature map, carrying out intensive sampling on the multi-scale feature map, and carrying out primary refining on a prediction frame through the pseudo label;

2. The method for detecting the weakly supervised target of saliency guidance as recited in claim 1, wherein the step 1 is used to obtain the visual saliency detection result of the input training image, and the specific steps include:

φ_mn＝φ(x_mn；θ) (1)

Mapping salient and background regions simultaneously to a D-dimensional matrix using a deep neural network psi containing a parameter η

μ_mk：μ_mk＝ψ(C_mk；η)，k＝1，2 (2)；

s14: defining a loss function:

wherein t is_mnIs an indicator variable, t_mn1 indicates that the pixel belongs to a salient region, x_mn∈C_m1；t_mn0 indicates that the pixel belongs to the background region, x_mn∈C_m2(ii) a And optimizing the loss function by using a gradient descent method to obtain a significance detection result.

3. The method for detecting the weakly supervised target guided by the saliency according to claim 1, wherein the step 2 is to convert the saliency map in the gray scale form into the saliency map in the binary form by means of threshold segmentation and remove the noise points by means of morphological calculation, and comprises the following specific steps:

4. The method for detecting the weakly supervised target guided by the saliency according to claim 1, wherein in the step 3, the saliency map in the binary form is used for providing target position information, the target position information is combined with image-level labeling to generate a pseudo label, and whether the labeling of all the training images is completed is judged, and the specific steps include:

S32: and (4) when all the training images finish the pseudo-labeling, entering the training of the next stage, otherwise, continuing to perform the step 1 to finish the generation of the pseudo-labeling.

5. The method for detecting the weakly supervised target guided by saliency according to claim 1, wherein the step 4 is to perform feature extraction by using a convolutional neural network, propose anchor points for prediction and perform refinement of the anchor points by using pseudo labeling, and comprises the following specific steps:

s41: extracting characteristics: VGG16 is used as a basic network for feature extraction, and a plurality of convolution layers are additionally added; the size of the input image is scaled to H; finally, a total of 4 signatures were extracted from Conv4_3, Conv5_3, Conv7 and the additional convolutional layer, the resolutions of which were respectively

Wherein the content of the first and second substances,andrespectively a two-classification loss function and a regression loss function,is the class of the object within the anchor point,indicates that the anchor belongs to the background, andindicating that anchor belongs to a certain class of objects, p_iJudging whether the anchor point is the probability of the target; x is the number of_iSeats representing anchor pointsThe mark is that,representing the position of the target in the pseudo label; n is a radical of_aRepresenting the number of anchor points in the anchor point refining module;

using two classes of L₂Loss function:

if it isThen l_i1, otherwise_i＝0；

Use ofFunction:

wherein:

wherein x is_i、x_ai、Respectively representing the positions of the anchor point after refining, the anchor point before refining and the pseudo label;

the criteria for positive sample selection is that the intersection ratio of target pseudo labels of any class IoU > 0.5:

6. The method for detecting the weakly supervised target guided by saliency of claim 5, wherein the step 5 of fusing the multi-scale feature map enlarges the deep-layer feature map by deconvolution and fuses the deep-layer feature map with the shallow-layer feature map, and the method comprises the following specific steps:

s52: and (3) multi-class classification regression of the anchor points: based on the anchor points subjected to the primary classification regression, the method is further carried outFine multi-class classification regression; defining a detection loss function

using L₂Loss function:

use ofFunction, synchronization step 4;

s53: constructing a multitask loss function together:

7. The method for detecting the weakly supervised target guided by the saliency of claim 6, wherein the image to be detected is input into a trained convolutional neural network, and the specific steps include: inputting an image to be detected into a network, and scaling the image to be detected into a specified size by the network; extracting 4 characteristic graphs through convolution, carrying out intensive sampling on the characteristic graphs to obtain anchor points, screening the anchor points subjected to intensive sampling through an anchor point refining module to remove simple negative samples, and finely adjusting the positions of the anchor points; fusing the feature maps by using deconvolution to obtain 4 feature maps fused with deep semantic information, and classifying and regressing by using the refined anchor points to obtain a final detection position; and finally, returning the detection result to the size of the original image according to the proportion to obtain the detected target position.