CN115620101A - Weak supervision significance detection method based on mixed label and training strategy - Google Patents

Weak supervision significance detection method based on mixed label and training strategy Download PDF

Info

Publication number
CN115620101A
CN115620101A CN202211081469.3A CN202211081469A CN115620101A CN 115620101 A CN115620101 A CN 115620101A CN 202211081469 A CN202211081469 A CN 202211081469A CN 115620101 A CN115620101 A CN 115620101A
Authority
CN
China
Prior art keywords
net
training
features
label
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211081469.3A
Other languages
Chinese (zh)
Inventor
丛润民
秦萁
熊航
刘鸿羽
白慧慧
赵耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202211081469.3A priority Critical patent/CN115620101A/en
Publication of CN115620101A publication Critical patent/CN115620101A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a weak supervision significance detection method and a training strategy based on a mixed label. The invention provides a two-stage network for respectively correcting a rough label and detecting a salient object in an RGB image, and a method thereof: in the correction network, a mixer module with a guiding and aggregation mechanism is designed, and features are aggregated and corrected at different stages; in addition, a special iterative training strategy is provided, and the accurate label is fully utilized. The test framework and method of the present invention achieve competitive performance on multiple public reference datasets. Under the conditions of multiple targets, complex background, low contrast and the like, the method can have better prediction results under the holding of the rough label.

Description

Hybrid label-based weak supervision significance detection method and training strategy
Technical Field
The invention relates to the technical field of image processing, in particular to a weak supervision significance detection method and a training strategy based on a mixed label.
Background
Images, which are the most direct information that a person can intuitively feel, contain more information than artificially processed characters, and therefore people often use images to perform important activities such as information acquisition, information expression, and information transmission. The images noticed by human eyes from the images are often classified successively, the image noticed firstly can most reflect the most noticed part of the human eyes, and the saliency detection is the image processing technology. The method detects the object most concerned by human eyes from an RGB color photo or a video, is helpful for quickly refining the information of an image, and can greatly facilitate image retrieval and improve the efficiency of image retrieval in the Internet era with huge information content. In addition, the information transmission in the internet era is complex, how to transmit more information under the same bandwidth often represents more actual network experience, and through the saliency detection technology, images or videos can be processed, main objects are extracted, and the images or videos are compressed, so that the sizes of the images or videos are reduced, and the network transmission efficiency is improved. These applications are only a part of salient object detection, such as object tracking, image/video editing, and other fields, and salient object detection techniques are also useful. Early learners generally use mathematical and statistical methods to detect salient objects, but with the rapid development of deep learning technology, image processing technologies such as salient object detection and the like get a qualitative leap, under the assistance of the deep learning technology, the detection accuracy rate has a qualitative leap, and in the field of object detection, computers surpass humans for the first time and have higher accuracy rate. However, the problem that deep learning often needs to provide accurate labeling information for a class of objects for neural network learning, and the accurate labeling is very time-consuming, requires professional software, and also consumes a great deal of manpower. On the basis, a weak supervision method is developed. Weakly supervised saliency target detection, aiming at using simpler annotation information (e.g. picture-level annotations, graffiti annotations, and coarse annotations). The labeling cost required for these labeling messages is relatively low, and can be completed within several seconds. In this way, large batches of data sets can be labeled.
However, the weak supervision mode adopted in the existing method usually uses sparse labels for supervision, while sparse labels such as graffiti labels only provide partial accurate information, and a small number of labels need to be expanded, so that inevitable errors occur, and a large amount of noise is introduced. In addition, the existing method usually adopts a single-stage training strategy, and the labels are difficult to refine.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a weak supervision significance detection method based on a mixed label. Specifically, the invention takes a large number of coarse labels and a small number of real labels as supervision, decouples the task into two subtasks of coarse label refinement and significant target detection, and designs a corresponding correction network (R-Net) and a significance prediction network (S-Net). The R-Net designs a mixer module with a guiding and aggregation mechanism to realize two-stage feature decoding, wherein the guiding stage is used for introducing guiding information (such as position information and integrity of a target) from an RGB image guiding branch so as to ensure the robustness of a baseline, and the aggregation stage dynamically integrates features of different levels according to the correction or supplement effect of the features.
In order to achieve the above purposes, the technical scheme adopted by the invention is as follows:
a weak supervision significance detection method based on a mixed label is characterized by comprising R-Net and S-Net;
the R-Net integrally adopts an encoder-decoder framework and is used for receiving information of the main flow branch and the guide branch to form a double-flow encoding structure; the R-Net also comprises a mixer BGA with a guiding and aggregation mechanism, and is used for realizing characteristic decoding of a guiding phase and an aggregation phase;
the S-Net is used for performing significance detection on the RGB picture to be predicted under the supervision of a real label;
the R-Net refers to a correction network, and the S-Net refers to a significance prediction network; the main stream branch refers to a main stream repairing branch and comprises an RGB (red, green and blue) picture and a rough label; the guide branch refers to an independent RGB picture guide branch;
the guiding phase is to supplement the main flow branch by using the information of the guiding branch; the aggregation stage refers to the integration of the encoder features of the respective layer, the decoder features of the previous layer, and the global features from the top layer of the encoder.
On the basis of the above-mentioned scheme,
in the guiding stage, encoders of corresponding layers in the main flow branch and the guiding branch are connected in series for supplementation, and then necessary channel characteristics are highlighted by channel attention for filtering, and the filtering is expressed by an expression (1):
Figure BDA0003833383230000031
in the above-mentioned formula, the compound has the following structure,
Figure BDA0003833383230000032
representing the complementary features after channel attention processing, CA is channel attention manipulation,
Figure BDA0003833383230000033
and
Figure BDA0003833383230000034
encoder features representing the i-th layer in the main stream branch and the leading branch, respectively, [,]indicating a cascading operation along the channel dimension.
Figure BDA0003833383230000035
Representing element-by-element multiplication of broadcasts with channel dimensions;
generating a spatial position mask needing to be emphasized from the perspective of RGB information, and updating the characteristics of the refinement branch according to the spatial position mask, wherein the spatial position mask is expressed by an expression (2):
Figure BDA0003833383230000036
in the above formula, the first and second carbon atoms are,
Figure BDA0003833383230000037
for the encoder feature of the final output of the post spatial attention boost stage, SA is a spatial attention operation, an 1×1 Represents a convolution layer having a convolution kernel size of 1 × 1.
On the basis of the scheme, the method comprises the following steps of,
in the aggregation stage, the encoder features of the corresponding layer generated in the guidance stage, the global features from the top-layer encoder, and the decoder features of the previous layer are integrated, specifically:
combining the semantic features of the main flow branch and the guide branch with the encoder features generated in the guide stage through an importance weighting strategy, and expressing the semantic features as an expression (3):
Figure BDA0003833383230000038
in the above formula, the first and second carbon atoms are,
Figure BDA0003833383230000039
represents the fused semantic features from both branches, i.e. the global features from the top-level encoder;
Figure BDA00038333832300000310
encoder features for respective layers; p is i Is an importance weight of learning for controlling f g And
Figure BDA00038333832300000311
fusion proportion of features;
then, the fusion feature containing the global semantic information
Figure BDA0003833383230000041
Activated as a semantic mask, for modifying the upsampled decoder features, expressed as equation (4):
Figure BDA0003833383230000042
in the above formula, the first and second carbon atoms are,
Figure BDA0003833383230000043
is a modified decoder feature of the ith layer;
Figure BDA0003833383230000044
represents the original decoder characteristics of the i +1 th layer, namely the decoder characteristics of the previous layer; up represents the Up-sampling operation of bilinear interpolation, and sigma represents a sigmoid activation function;
the modified decoder features are further supplemented with the filtered features by a spatial attention mechanism to obtain more comprehensive significance-related decoder features, expressed as equation (5):
Figure BDA0003833383230000045
in the above-mentioned formula, the compound has the following structure,
Figure BDA0003833383230000046
is a decoder feature of the i-th layer modification and SA denotes spatial attention operation.
It is another object of the present invention to provide a hybrid label-based weakly supervised saliency detection training strategy.
In order to achieve the purpose, the invention adopts the technical scheme that:
a hybrid label based weakly supervised saliency training strategy is characterized by comprising the following steps:
step 1, randomly selecting a certain number of samples in a data set as a training subset of real labels, and generating corresponding rough labels of all the samples in the data set by using a minimum grid method;
step 2, averagely dividing all training samples including the real labels and the rough labels obtained in the step 1 into n groups, wherein the real labels are all classified into the group 1, and other rough labels are classified into the groups 2 to n;
step 3, training R-Net and S-Net by using each group of training samples obtained in the step 2 by using an alternating increment iteration mechanism until all the training samples are traversed;
and 4, directly inputting the RGB picture into an S-Net network, and outputting a significant target detection result by the S-Net according to the input picture.
On the basis of the scheme, the process for training R-Net and S-Net by using the alternating increment iteration mechanism in the step 3 specifically comprises the following steps:
3-1, training the R-Net by using the samples with the real labels in the group 1, testing the samples in the group 2 by using the trained R-Net to obtain corresponding pseudo labels, and sending the samples with the real labels in the group 1 and the samples with the pseudo labels in the group 2 into the S-Net to train the S-Net again; obtaining a pseudo label for R-Net training in the next iteration by using the retrained samples in the S-Net test group 3;
step 3-2, continuously sending the samples with the pseudo labels in the group k to R-Net for training, testing the samples in the group k +1 by the retrained R-Net to obtain corresponding pseudo labels, and continuously sending the samples with the pseudo labels in the group k +1 to S-Net for retraining; obtaining a pseudo label required by R-Net training in the next iteration by using the retrained sample in the S-Net testing group k + 2;
3-3, alternately and iteratively training R-Net and S-Net again according to the steps in the step 3-2 until the training is finished;
the mark of the completion of the training is as follows: all training samples are traversed, namely all training samples are sent to R-Net and S-Net to finish training;
in the step 3-2, the initial value of k is 3; once per iteration as described in step 3-3, the value of k described in step 3-2 is increased by 2 from the last iteration.
On the basis of the above-mentioned scheme,
after each iteration, using the average absolute error MAE to implement a credibility verification mechanism on R-Net and S-Net, and specifically comprising the following steps:
calculating and comparing the average absolute error MAE scores of the R-Net after the t iteration and the R-Net after the t-1 iteration on a verification set, and selecting the R-Net with the small average absolute error MAE score to test the samples in the group 2t and generate corresponding pseudo labels;
calculating average absolute error MAE scores on a verification set by the trained S-Net after the t-th iteration and the trained S-Net after the t-1-th iteration, comparing, and selecting the S-Net with the small average absolute error MAE score to test samples in the group 2t +1 and generate corresponding pseudo labels;
the initial value of t is 2; and 2t +1 but n refers to the training sample grouping number described in step 2.
The weak supervision significance detection method and the training strategy based on the mixed label have the beneficial effects that:
1. the invention explores a new mixed label-based weak supervision significance detection task and provides a two-stage network to respectively correct rough labels and predict significant objects in RGB images. To this end, the invention provides for a mixer module with a steering and aggregation mechanism in the correction network, aggregating and correcting the features at different stages. In addition, the invention provides a special iterative training strategy aiming at the new task, and realizes the full utilization of the accurate label.
2. The method of the present invention achieves competitive performance over a plurality of published reference datasets. Under the conditions of multiple targets, complex backgrounds, low contrast and the like, the method can have a better prediction result under the holding of the rough label.
Drawings
The invention has the following drawings:
FIG. 1 is a network architecture diagram of a hybrid tag-based method for detecting significance in weak supervision according to the present invention;
FIG. 2 is a visualization result diagram of the weak supervised saliency detection method based on mixed labels.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
The overall network architecture of the weak supervision significance detection method based on the mixed labels is shown in fig. 1 and consists of a correction network (R-Net) and a significance prediction network (S-Net). The two networks cooperate to alternately train. During training, the two networks employ an alternating incremental iteration mechanism to solve the problem of imbalance between the true labeled data and the pseudo labeled data, and a confidence verification mechanism is used to ensure that the two networks can provide reliable labels for each other.
The predictive networks in the two sub-networks designed by the present invention are replaceable, so the present invention focuses on modifying the design of the network. The correction network consists of two stages of guiding and aggregation, the whole adopts an encoder-decoder framework, and an encoder adopts a double-flow structure. Specifically, with the RGB picture and the rough label as input, considering uncertainty and noise of the rough label, an independent RGB picture guiding branch is introduced into the R-Net to form a dual-stream coding structure for providing some guiding information such as object positioning and integrity to the main stream branch, thereby ensuring that a relatively robust performance baseline can be obtained. In addition, in order to ensure the effect and efficiency of network training, the invention provides a corresponding training strategy from the aspects of quantity distribution, a training method and reliability judgment, and an alternate increment iteration mechanism and a reliability verification mechanism are respectively designed.
The encoders that modify both data streams of the network extract the corresponding multi-level features based on ResNet-50. The invention then proposes a mixer (BGA) with a steering and aggregation mechanism to implement two-stage feature decoding. The first stage is used for guiding, i.e. supplementing the main flow branch with information of the guiding branch, and ensuring that the main flow branch has a relatively robust baseline performance. The second stage has the effect of aggregation, i.e. integrating the encoder features of the corresponding layer, the decoder features of the previous layer and global features from the top layer of the encoder by taking into account the effects of the different features.
In the first phase of modifying the network, the boot phase, the present invention expects that the RGB branch can provide boot information (such as object localization and integrity) for the main stream branch, thereby ensuring its effective learning and robust performance baseline.
First, to ensure that enough salient information can be transferred to the mainstream branches and to mitigate unreliable noise from the coarse tag input, the present invention supplements and filters the features of the channel dimensions. Specifically, the encoder features of the corresponding layers in the two branches are first concatenated to complement, and then the necessary channel features are highlighted with channel attention for filtering. This process can be expressed as:
Figure BDA0003833383230000071
wherein
Figure BDA0003833383230000072
Indicating the complementary feature after the channel attention boost, CA is the channel attention operation,
Figure BDA0003833383230000073
and
Figure BDA0003833383230000074
encoder features representing the i-th layer in the main stream branch and the leading branch, respectively, [,]indicating a cascading operation along the channel dimension.
Figure BDA0003833383230000075
Representing an element-by-element multiplication broadcast with the channel dimension.
Secondly, besides direct supplement of channel dimensions, the RGB branches can also provide pixel-level spatial guidance information, which can both reinforce important areas and suppress irrelevant noise interference. Specifically, the invention uses spatial attention to generate a spatial position mask to be emphasized from the perspective of RGB information, and updates the characteristics of refinement branches in this way:
Figure BDA0003833383230000081
wherein
Figure BDA0003833383230000082
Encoder feature of i-th layer for final output of the spatial attention boosting post-boot stage, SA being a spatial attention operation, being an element-by-element multiplication operation, conv 1×1 Represents a convolution layer having a convolution kernel size of 1 × 1.
The second stage is mainly used for realizing the aggregation of multi-stage features, and the encoder features of the corresponding layer generated in the first stage, the global features from the top encoder and the decoder features of the previous layer are integrated. In order to achieve polymerization more efficiently, the effects of various characteristics need to be analyzed. In general, both the encoder features and the global features should assist in the feature decoding stage to obtain better decoder features. The assistance can be divided into two areas: firstly, the characteristics of a decoder are perfected under the guidance of global information; the second is to supplement the decoder features under the direction of the encoder features.
First, global features from the top encoder layer are critical to distinguish salient objects, but as the decoding process progresses, the semantic constraints will gradually fade. Therefore, in order to enforce semantic information throughout the decoding process, the present invention generates a corresponding semantic guide mask to refine the decoder characteristics of each layer. Specifically, the invention first combines the semantic features of the two branches and the encoder features generated in the first stage by an importance weighting strategy:
Figure BDA0003833383230000083
wherein
Figure BDA0003833383230000084
Representing signals from two branchesThe fused semantic features of (1), namely the global features from the top-level encoder;
Figure BDA0003833383230000085
encoder features for respective layers; p is i Is an importance weight of learning for controlling f g And
Figure BDA0003833383230000086
fusion of features. Then, the fusion feature containing the global semantic information
Figure BDA0003833383230000087
Activated as a semantic mask, for modifying the upsampled decoder features:
Figure BDA0003833383230000088
wherein
Figure BDA0003833383230000089
Is a modified decoder feature of the ith layer;
Figure BDA00038333832300000810
represents the original decoder characteristics of the i +1 th layer, i.e. the decoder characteristics of the previous layer; up represents the Up-sampling operation of bilinear interpolation, and sigma represents sigmoid activation function.
Second, since the encoder features contain much valuable information, the learning of the decoder features can be supplemented, e.g., shallower features include rich spatial information, details can be better recovered, etc. Thus, the present invention further supplements the modified decoder features with the filtered features through a spatial attention mechanism to obtain more comprehensive significance-related decoder features. This process can be expressed as:
Figure BDA0003833383230000091
wherein
Figure BDA0003833383230000092
Is a decoder characteristic of the i-th layer modification and SA denotes spatial attention operation.
The training set used in the present invention (e.g., DUTS-TR data set) should contain true labels and coarse labels at the pixel level, where the coarse labels are only used as inputs to S-Net and not as a supervision. Meanwhile, in the network training process, the pseudo label is generated by the network as supervision information. In practice, the present invention randomly selects 1000 samples from the DUTS-TR dataset containing 10000 samples as a training subset with true labels, and generates the corresponding coarse labels for all samples in the DUTS-TR dataset (including the aforementioned 1000 samples) using the minimized grid method (J.Zhang, S.Sclaroff, Z.Lin, X.Shen, B.price, and R.Mech, "Minimum barrier present object detection at 80fps," in Proc.CVPR,2015, pp.1404-1412). Aiming at the mixed data, in order to ensure the effect and efficiency of network training, the invention provides two key training mechanisms, including an alternate increment iteration mechanism and a credibility verification mechanism.
An alternating incremental iteration mechanism. Specifically, the S-Net of the current iteration is trained by using the pseudo label generated by the R-Net of the current iteration, and the pseudo label generated by the trained S-Net is further used for the R-Net training of the next iteration. Secondly, in the weakly supervised saliency target detection framework of mixed labels, another important issue is the sample imbalance caused by the difference in the number of real label samples and coarse label samples. If unbalanced training samples are used directly for network training, network collapse may result. Therefore, the two networks of the present invention are trained in an alternating manner, and training samples are gradually added to the two networks in an incremental manner until all training samples are traversed, which is referred to as an alternating incremental iteration mechanism.
The process of alternating incremental iteration specifically comprises:
first, as described above, 1000 samples in the training set (i.e. the data set, wherein the total number of samples is 10000) are randomly selected as the training subset of the true labels, and the corresponding coarse labels of all the samples are generated by using the minimum grid method.
All training samples including the true label and the coarse label are grouped on average. For example, 10000 samples can be averagely divided into 10 groups named as group 1 to group 10, the number of the samples in each group is 1000, and 1000 samples in group 1 are true labeled samples.
Training the R-Net by using the sample with the real label in the group 1, testing the sample in the group 2 by using the trained R-Net to obtain a corresponding pseudo label, and sending the sample with the real label in the group 1 and the sample with the pseudo label in the group 2 into the S-Net for training again; obtaining a pseudo label for R-Net training in the next iteration by using the retrained samples in the S-Net test group 3;
sequentially, continuously sending the samples with the pseudo labels in the group k into R-Net for training, testing the samples in the group k +1 by using the retrained R-Net to obtain corresponding pseudo labels, and continuously sending the samples with the pseudo labels in the group k +1 into S-Net for retraining; obtaining a pseudo label required by R-Net training in the next iteration by using the retrained sample in the S-Net testing group k + 2;
the initial value of k is 3; each iteration is performed, the k value is increased by 2 compared with the last iteration, and the method specifically comprises the following steps:
when k is an initial value of 3, it corresponds to a second iteration, namely: continuing to send the samples with the pseudo labels in the group 3 into the R-Net for training, testing the samples in the group 4 by using the trained R-Net to obtain corresponding pseudo labels, and continuing to send the samples with the pseudo labels in the group 4 into the S-Net for training the S-Net again; obtaining a pseudo label for R-Net training in the next iteration by using the sample in the retrained S-Net test group 5;
and when the third iteration is performed, the k value is increased by 2 compared with the second iteration, namely 5, and the third iteration specifically comprises the following steps:
continuing to send the samples with the pseudo labels in the group 5 into the R-Net to train the R-Net, testing the samples in the group 6 by using the trained R-Net to obtain corresponding pseudo labels, and continuing to send the samples with the pseudo labels in the group 6 into the S-Net to train again; obtaining a pseudo label for R-Net training in the next iteration by using the sample in the retrained S-Net test group 7;
the fourth and subsequent iterations are analogized as described above until the training is complete, which in this example requires 5 iterations.
The training is completed as all training samples are added to the network for training, i.e., all samples in groups 1 to 10 are trained in R-Net or S-Net.
In addition, because both R-Net and S-Net are trained by mixed labels, in order to ensure the effectiveness and performability of training, the invention firstly trains the network under pseudo label samples, and then finely adjusts the samples of real labels in a training period.
It should be noted that, after the alternate increment iteration mechanism is introduced, the number of sample groups needs to be properly selected according to different sample numbers in the training set.
A trust verification mechanism. The purpose of alternate training of the two networks is to provide better pseudo labels for the other party, so the invention introduces a credibility verification mechanism from the second iteration, and tests are carried out on a verification set comprising 100 pictures to ensure the validity of the provided labels. Only when the performance of the current model is superior to that of the previous best model on the verification set, the invention uses the current model to generate the corresponding group of pseudo labels and participates in the next training, specifically:
comparing the average absolute error MAE scores of the R-Net after the t iteration and the R-Net after the t-1 iteration on the verification set, and selecting the R-Net with less average absolute error MAE scores to test the samples in the group 2t and generate corresponding pseudo labels;
comparing the average absolute error MAE scores of the S-Net trained after the t-th iteration and the S-Net trained after the t-1-th iteration on the verification set, and selecting the S-Net with less average absolute error MAE scores to test samples in the group 2t +1 and generate corresponding pseudo labels;
the initial value of t is 2; and 2t +1 yarn-over n, MAE refers to the average absolute error, which represents the error between the prediction and the ground truth, with smaller numbers meaning better.
The detection process of the invention is as follows: the RGB picture is directly input into a saliency prediction network S-Net after the overall training, and the S-Net automatically detects saliency objects from the picture according to learned knowledge and outputs the saliency objects in a mask mode.
After the RGB picture is input into the S-Net, the S-Net aggregation interaction module effectively utilizes adjacent features through a mutual learning mode, and self-interaction modules are used for adaptively extracting multi-scale information from the picture so as to meet significance targets of different physical sizes, and then the S-Net outputs a significance map. The detection process is described in detail in the prior art documents: L.Wang, H.Lu, Y.Wang, M.Feng, D.Wang, B.Yin, and X.Ruan, "Learning to detect liquid objects with image-level overview," in Proc.CVPR,2017, pp.136-145.
Fig. 2 gives an example of the visualization of the inventive technique. The first column is a color image, the second column is a truth map for RGB saliency target detection, the third column is a coarse saliency map generated by conventional methods, and the fourth column is a predicted saliency map of the present invention. It can be seen from the results that the inventive method achieves better visual results in many challenging scenarios.
Those not described in detail in this specification are within the skill of the art.

Claims (6)

1. A weak supervision significance detection method based on a mixed label is characterized by comprising R-Net and S-Net;
the R-Net integrally adopts an encoder-decoder framework and is used for receiving information of the main flow branch and the guide branch to form a double-flow encoding structure; the R-Net also comprises a mixer BGA with a guiding and aggregation mechanism, and is used for realizing characteristic decoding of a guiding phase and an aggregation phase;
the S-Net is used for performing significance detection on the RGB picture to be predicted under the supervision of a real label;
the R-Net refers to a correction network, and the S-Net refers to a significance prediction network; the main stream branch refers to a main stream repair branch and comprises an RGB (red, green and blue) picture and a rough label; the guide branch refers to an independent RGB picture guide branch;
the guiding phase is to supplement the main flow branch by using the information of the guiding branch; the aggregation stage refers to the integration of the encoder features of the respective layer, the decoder features of the previous layer, and the global features from the top layer of the encoder.
2. The weak surveillance significance detection method based on hybrid labels as claimed in claim 1, characterized in that:
in the guiding stage, encoders of corresponding layers in the main flow branch and the guiding branch are connected in series for supplementation, and then necessary channel characteristics are highlighted by channel attention for filtering, and the filtering is expressed by an expression (1):
Figure FDA0003833383220000011
in the above formula, the first and second carbon atoms are,
Figure FDA0003833383220000012
representing the complementary features after channel attention processing, CA is channel attention manipulation,
Figure FDA0003833383220000013
and
Figure FDA0003833383220000014
encoder features representing the i-th layer in the main stream branch and the leading branch, respectively, [,]representing cascading operations along a channel dimension;
Figure FDA0003833383220000016
representing element-by-element multiplication of broadcast with channel dimensions;
generating a spatial position mask needing to be emphasized from the perspective of RGB information, and updating the characteristics of the refinement branch according to the spatial position mask, wherein the spatial position mask is expressed by an expression (2):
Figure FDA0003833383220000015
in the above formula, the first and second carbon atoms are,
Figure FDA0003833383220000021
encoder feature of final output of the boot stage after spatial attention enhancement, SA being a spatial attention operation, a one-element multiplication operation, conv 1×1 Represents a convolution layer having a convolution kernel size of 1 × 1.
3. The weak supervision significance detection method based on mixed label as claimed in claim 1 is characterized by:
in the aggregation stage, the encoder features of the corresponding layer generated in the guidance stage, the global features from the top-layer encoder, and the decoder features of the previous layer are integrated, specifically:
combining the semantic features of the main flow branch and the guide branch with the encoder features generated in the guide stage through an importance weighting strategy, and expressing the semantic features as an expression (3):
Figure FDA0003833383220000022
in the above formula, the first and second carbon atoms are,
Figure FDA0003833383220000023
represents the fused semantic features from both branches, i.e. the global features from the top-level encoder;
Figure FDA0003833383220000024
encoder features for respective layers; p i Is an importance weight of learning for controlling f g And
Figure FDA0003833383220000025
fusion proportion of features;
then, the fusion feature containing the global semantic information
Figure FDA00038333832200000211
Decoder features activated as semantic masks for modifying the upsampling expressed as equation (4):
Figure FDA0003833383220000026
in the above formula, the first and second carbon atoms are,
Figure FDA0003833383220000027
is a modified decoder feature of the ith layer;
Figure FDA0003833383220000028
represents the original decoder characteristics of the i +1 th layer, i.e. the decoder characteristics of the previous layer; up represents the Up-sampling operation of bilinear interpolation, and sigma represents a sigmoid activation function;
the modified decoder features are further supplemented with the filtered features by a spatial attention mechanism to obtain more comprehensive significance-related decoder features, expressed as equation (5):
Figure FDA0003833383220000029
in the above formula, the first and second carbon atoms are,
Figure FDA00038333832200000210
is a decoder feature of the i-th layer modification and SA denotes spatial attention operation.
4. A hybrid label based weak supervised saliency detection training strategy is characterized by comprising the following steps:
step 1, randomly selecting a certain number of samples in a data set as a training subset of real labels, and generating corresponding rough labels of all the samples in the data set by using a minimum grid method;
step 2, averagely dividing all training samples including the real labels and the rough labels obtained in the step 1 into n groups, wherein the real labels are classified into the group 1, and other rough labels are classified into the groups 2 to n;
step 3, training R-Net and S-Net by using each group of training samples obtained in the step 2 through an alternating increment iteration mechanism until all the training samples are traversed;
and 4, directly inputting the RGB picture into an S-Net network, and outputting a significance detection result by the S-Net according to the input picture.
5. A hybrid label based weakly supervised saliency training strategy as claimed in claim 4, characterized in that: the process of training R-Net and S-Net by using the alternate increment iteration mechanism in the step 3 specifically comprises the following steps:
3-1, training the R-Net by using the sample with the real label in the group 1, testing the sample in the group 2 by using the trained R-Net to obtain a corresponding pseudo label, and sending the sample with the real label in the group 1 and the sample with the pseudo label in the group 2 into the S-Net to train the S-Net again; obtaining a pseudo label for R-Net training in the next iteration by using the retrained samples in the S-Net test group 3;
step 3-2, continuously sending the samples with the pseudo labels in the group k into R-Net for training, testing the samples in the group k +1 by using the retrained R-Net to obtain corresponding pseudo labels, and continuously sending the samples with the pseudo labels in the group k +1 into S-Net for retraining; obtaining a pseudo label required by R-Net training in the next iteration by using the retrained sample in the S-Net testing group k + 2;
3-3, alternately and iteratively training R-Net and S-Net again according to the steps in the step 3-2 until the training is finished;
the mark of the completion of the training is as follows: all training samples are traversed, namely all training samples are sent to R-Net and S-Net to finish training;
in the step 3-2, the initial value of k is 3; once per iteration as described in step 3-3, the value of k described in step 3-2 is increased by 2 from the last iteration.
6. The hybrid label-based weakly supervised saliency detection training strategy of claim 5, wherein:
after each iteration, using the average absolute error MAE to implement a credibility verification mechanism on R-Net and S-Net, and specifically comprising the following steps:
calculating and comparing the average absolute error MAE scores of the trained R-Net after the t-th iteration and the trained R-Net after the t-1-th iteration on a verification set, and selecting the R-Net with the small average absolute error MAE score to test the samples in the group 2t and generate corresponding pseudo labels;
calculating average absolute error MAE scores on a verification set by the trained S-Net after the t-th iteration and the trained S-Net after the t-1-th iteration, comparing the average absolute error MAE scores, and selecting the S-Net with the small average absolute error MAE score to test samples in the group 2t +1 and generate corresponding pseudo labels;
the initial value of t is 2; and 2t +1 yarn-over n, n refers to the training sample grouping number described in step 2.
CN202211081469.3A 2022-09-06 2022-09-06 Weak supervision significance detection method based on mixed label and training strategy Pending CN115620101A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211081469.3A CN115620101A (en) 2022-09-06 2022-09-06 Weak supervision significance detection method based on mixed label and training strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211081469.3A CN115620101A (en) 2022-09-06 2022-09-06 Weak supervision significance detection method based on mixed label and training strategy

Publications (1)

Publication Number Publication Date
CN115620101A true CN115620101A (en) 2023-01-17

Family

ID=84858477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211081469.3A Pending CN115620101A (en) 2022-09-06 2022-09-06 Weak supervision significance detection method based on mixed label and training strategy

Country Status (1)

Country Link
CN (1) CN115620101A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189058A (en) * 2023-03-03 2023-05-30 北京信息科技大学 Video saliency target detection method and system based on unsupervised deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189058A (en) * 2023-03-03 2023-05-30 北京信息科技大学 Video saliency target detection method and system based on unsupervised deep learning
CN116189058B (en) * 2023-03-03 2023-10-03 北京信息科技大学 Video saliency target detection method and system based on unsupervised deep learning

Similar Documents

Publication Publication Date Title
Zhou et al. Edge-guided recurrent positioning network for salient object detection in optical remote sensing images
CN108256562B (en) Salient target detection method and system based on weak supervision time-space cascade neural network
Xie et al. Pyramid grafting network for one-stage high resolution saliency detection
Ji et al. Encoder-decoder with cascaded CRFs for semantic segmentation
CN110879959B (en) Method and device for generating data set, and testing method and testing device using same
Cong et al. A weakly supervised learning framework for salient object detection via hybrid labels
CN115100235A (en) Target tracking method, system and storage medium
CN111932431B (en) Visible watermark removing method based on watermark decomposition model and electronic equipment
CN112150450B (en) Image tampering detection method and device based on dual-channel U-Net model
Liao et al. A deep ordinal distortion estimation approach for distortion rectification
CN110827265A (en) Image anomaly detection method based on deep learning
CN111723852B (en) Robust training method for target detection network
CN116363489A (en) Copy-paste tampered image data detection method, device, computer and computer-readable storage medium
CN115908789A (en) Cross-modal feature fusion and asymptotic decoding saliency target detection method and device
Wang et al. Thermal images-aware guided early fusion network for cross-illumination RGB-T salient object detection
CN115410081A (en) Multi-scale aggregated cloud and cloud shadow identification method, system, equipment and storage medium
CN112529862A (en) Significance image detection method for interactive cycle characteristic remodeling
Liu et al. Multi-scale iterative refinement network for RGB-D salient object detection
CN115331024A (en) Intestinal polyp detection method based on deep supervision and gradual learning
CN115620101A (en) Weak supervision significance detection method based on mixed label and training strategy
Ye et al. Enhanced feature pyramid network for semantic segmentation
Gu et al. FBI-Net: Frequency-based image forgery localization via multitask learning With self-attention
Yuan et al. Recurrent structure attention guidance for depth super-resolution
Zhou et al. STI-Net: Spatiotemporal integration network for video saliency detection
Zhang et al. Diffusionengine: Diffusion model is scalable data engine for object detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination