CN115620101A

CN115620101A - Weak supervision significance detection method based on mixed label and training strategy

Info

Publication number: CN115620101A
Application number: CN202211081469.3A
Authority: CN
Inventors: 丛润民; 秦萁; 熊航; 刘鸿羽; 白慧慧; 赵耀
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2022-09-06
Filing date: 2022-09-06
Publication date: 2023-01-17

Abstract

The invention relates to a weak supervision significance detection method and a training strategy based on a mixed label. The invention provides a two-stage network for respectively correcting a rough label and detecting a salient object in an RGB image, and a method thereof: in the correction network, a mixer module with a guiding and aggregation mechanism is designed, and features are aggregated and corrected at different stages; in addition, a special iterative training strategy is provided, and the accurate label is fully utilized. The test framework and method of the present invention achieve competitive performance on multiple public reference datasets. Under the conditions of multiple targets, complex background, low contrast and the like, the method can have better prediction results under the holding of the rough label.

Description

Hybrid label-based weak supervision significance detection method and training strategy

Technical Field

The invention relates to the technical field of image processing, in particular to a weak supervision significance detection method and a training strategy based on a mixed label.

Background

Images, which are the most direct information that a person can intuitively feel, contain more information than artificially processed characters, and therefore people often use images to perform important activities such as information acquisition, information expression, and information transmission. The images noticed by human eyes from the images are often classified successively, the image noticed firstly can most reflect the most noticed part of the human eyes, and the saliency detection is the image processing technology. The method detects the object most concerned by human eyes from an RGB color photo or a video, is helpful for quickly refining the information of an image, and can greatly facilitate image retrieval and improve the efficiency of image retrieval in the Internet era with huge information content. In addition, the information transmission in the internet era is complex, how to transmit more information under the same bandwidth often represents more actual network experience, and through the saliency detection technology, images or videos can be processed, main objects are extracted, and the images or videos are compressed, so that the sizes of the images or videos are reduced, and the network transmission efficiency is improved. These applications are only a part of salient object detection, such as object tracking, image/video editing, and other fields, and salient object detection techniques are also useful. Early learners generally use mathematical and statistical methods to detect salient objects, but with the rapid development of deep learning technology, image processing technologies such as salient object detection and the like get a qualitative leap, under the assistance of the deep learning technology, the detection accuracy rate has a qualitative leap, and in the field of object detection, computers surpass humans for the first time and have higher accuracy rate. However, the problem that deep learning often needs to provide accurate labeling information for a class of objects for neural network learning, and the accurate labeling is very time-consuming, requires professional software, and also consumes a great deal of manpower. On the basis, a weak supervision method is developed. Weakly supervised saliency target detection, aiming at using simpler annotation information (e.g. picture-level annotations, graffiti annotations, and coarse annotations). The labeling cost required for these labeling messages is relatively low, and can be completed within several seconds. In this way, large batches of data sets can be labeled.

However, the weak supervision mode adopted in the existing method usually uses sparse labels for supervision, while sparse labels such as graffiti labels only provide partial accurate information, and a small number of labels need to be expanded, so that inevitable errors occur, and a large amount of noise is introduced. In addition, the existing method usually adopts a single-stage training strategy, and the labels are difficult to refine.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a weak supervision significance detection method based on a mixed label. Specifically, the invention takes a large number of coarse labels and a small number of real labels as supervision, decouples the task into two subtasks of coarse label refinement and significant target detection, and designs a corresponding correction network (R-Net) and a significance prediction network (S-Net). The R-Net designs a mixer module with a guiding and aggregation mechanism to realize two-stage feature decoding, wherein the guiding stage is used for introducing guiding information (such as position information and integrity of a target) from an RGB image guiding branch so as to ensure the robustness of a baseline, and the aggregation stage dynamically integrates features of different levels according to the correction or supplement effect of the features.

In order to achieve the above purposes, the technical scheme adopted by the invention is as follows:

a weak supervision significance detection method based on a mixed label is characterized by comprising R-Net and S-Net;

the R-Net integrally adopts an encoder-decoder framework and is used for receiving information of the main flow branch and the guide branch to form a double-flow encoding structure; the R-Net also comprises a mixer BGA with a guiding and aggregation mechanism, and is used for realizing characteristic decoding of a guiding phase and an aggregation phase;

the S-Net is used for performing significance detection on the RGB picture to be predicted under the supervision of a real label;

the R-Net refers to a correction network, and the S-Net refers to a significance prediction network; the main stream branch refers to a main stream repairing branch and comprises an RGB (red, green and blue) picture and a rough label; the guide branch refers to an independent RGB picture guide branch;

the guiding phase is to supplement the main flow branch by using the information of the guiding branch; the aggregation stage refers to the integration of the encoder features of the respective layer, the decoder features of the previous layer, and the global features from the top layer of the encoder.

On the basis of the above-mentioned scheme,

in the guiding stage, encoders of corresponding layers in the main flow branch and the guiding branch are connected in series for supplementation, and then necessary channel characteristics are highlighted by channel attention for filtering, and the filtering is expressed by an expression (1):

in the above-mentioned formula, the compound has the following structure,

representing the complementary features after channel attention processing, CA is channel attention manipulation,

and

encoder features representing the i-th layer in the main stream branch and the leading branch, respectively, [,]indicating a cascading operation along the channel dimension.

Representing element-by-element multiplication of broadcasts with channel dimensions;

generating a spatial position mask needing to be emphasized from the perspective of RGB information, and updating the characteristics of the refinement branch according to the spatial position mask, wherein the spatial position mask is expressed by an expression (2):

in the above formula, the first and second carbon atoms are,

for the encoder feature of the final output of the post spatial attention boost stage, SA is a spatial attention operation, an _1×1 Represents a convolution layer having a convolution kernel size of 1 × 1.

On the basis of the scheme, the method comprises the following steps of,

in the aggregation stage, the encoder features of the corresponding layer generated in the guidance stage, the global features from the top-layer encoder, and the decoder features of the previous layer are integrated, specifically:

combining the semantic features of the main flow branch and the guide branch with the encoder features generated in the guide stage through an importance weighting strategy, and expressing the semantic features as an expression (3):

in the above formula, the first and second carbon atoms are,

represents the fused semantic features from both branches, i.e. the global features from the top-level encoder;

encoder features for respective layers; p is ⁱ Is an importance weight of learning for controlling f _g And

fusion proportion of features;

then, the fusion feature containing the global semantic information

Activated as a semantic mask, for modifying the upsampled decoder features, expressed as equation (4):

in the above formula, the first and second carbon atoms are,

is a modified decoder feature of the ith layer;

represents the original decoder characteristics of the i +1 th layer, namely the decoder characteristics of the previous layer; up represents the Up-sampling operation of bilinear interpolation, and sigma represents a sigmoid activation function;

the modified decoder features are further supplemented with the filtered features by a spatial attention mechanism to obtain more comprehensive significance-related decoder features, expressed as equation (5):

in the above-mentioned formula, the compound has the following structure,

is a decoder feature of the i-th layer modification and SA denotes spatial attention operation.

It is another object of the present invention to provide a hybrid label-based weakly supervised saliency detection training strategy.

In order to achieve the purpose, the invention adopts the technical scheme that:

a hybrid label based weakly supervised saliency training strategy is characterized by comprising the following steps:

step 1, randomly selecting a certain number of samples in a data set as a training subset of real labels, and generating corresponding rough labels of all the samples in the data set by using a minimum grid method;

step 2, averagely dividing all training samples including the real labels and the rough labels obtained in the step 1 into n groups, wherein the real labels are all classified into the group 1, and other rough labels are classified into the groups 2 to n;

step 3, training R-Net and S-Net by using each group of training samples obtained in the step 2 by using an alternating increment iteration mechanism until all the training samples are traversed;

and 4, directly inputting the RGB picture into an S-Net network, and outputting a significant target detection result by the S-Net according to the input picture.

On the basis of the scheme, the process for training R-Net and S-Net by using the alternating increment iteration mechanism in the step 3 specifically comprises the following steps:

3-1, training the R-Net by using the samples with the real labels in the group 1, testing the samples in the group 2 by using the trained R-Net to obtain corresponding pseudo labels, and sending the samples with the real labels in the group 1 and the samples with the pseudo labels in the group 2 into the S-Net to train the S-Net again; obtaining a pseudo label for R-Net training in the next iteration by using the retrained samples in the S-Net test group 3;

step 3-2, continuously sending the samples with the pseudo labels in the group k to R-Net for training, testing the samples in the group k +1 by the retrained R-Net to obtain corresponding pseudo labels, and continuously sending the samples with the pseudo labels in the group k +1 to S-Net for retraining; obtaining a pseudo label required by R-Net training in the next iteration by using the retrained sample in the S-Net testing group k + 2;

3-3, alternately and iteratively training R-Net and S-Net again according to the steps in the step 3-2 until the training is finished;

the mark of the completion of the training is as follows: all training samples are traversed, namely all training samples are sent to R-Net and S-Net to finish training;

in the step 3-2, the initial value of k is 3; once per iteration as described in step 3-3, the value of k described in step 3-2 is increased by 2 from the last iteration.

On the basis of the above-mentioned scheme,

after each iteration, using the average absolute error MAE to implement a credibility verification mechanism on R-Net and S-Net, and specifically comprising the following steps:

calculating and comparing the average absolute error MAE scores of the R-Net after the t iteration and the R-Net after the t-1 iteration on a verification set, and selecting the R-Net with the small average absolute error MAE score to test the samples in the group 2t and generate corresponding pseudo labels;

calculating average absolute error MAE scores on a verification set by the trained S-Net after the t-th iteration and the trained S-Net after the t-1-th iteration, comparing, and selecting the S-Net with the small average absolute error MAE score to test samples in the group 2t +1 and generate corresponding pseudo labels;

the initial value of t is 2; and 2t +1 but n refers to the training sample grouping number described in step 2.

The weak supervision significance detection method and the training strategy based on the mixed label have the beneficial effects that:

1. the invention explores a new mixed label-based weak supervision significance detection task and provides a two-stage network to respectively correct rough labels and predict significant objects in RGB images. To this end, the invention provides for a mixer module with a steering and aggregation mechanism in the correction network, aggregating and correcting the features at different stages. In addition, the invention provides a special iterative training strategy aiming at the new task, and realizes the full utilization of the accurate label.

2. The method of the present invention achieves competitive performance over a plurality of published reference datasets. Under the conditions of multiple targets, complex backgrounds, low contrast and the like, the method can have a better prediction result under the holding of the rough label.

Drawings

The invention has the following drawings:

FIG. 1 is a network architecture diagram of a hybrid tag-based method for detecting significance in weak supervision according to the present invention;

FIG. 2 is a visualization result diagram of the weak supervised saliency detection method based on mixed labels.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The overall network architecture of the weak supervision significance detection method based on the mixed labels is shown in fig. 1 and consists of a correction network (R-Net) and a significance prediction network (S-Net). The two networks cooperate to alternately train. During training, the two networks employ an alternating incremental iteration mechanism to solve the problem of imbalance between the true labeled data and the pseudo labeled data, and a confidence verification mechanism is used to ensure that the two networks can provide reliable labels for each other.

The predictive networks in the two sub-networks designed by the present invention are replaceable, so the present invention focuses on modifying the design of the network. The correction network consists of two stages of guiding and aggregation, the whole adopts an encoder-decoder framework, and an encoder adopts a double-flow structure. Specifically, with the RGB picture and the rough label as input, considering uncertainty and noise of the rough label, an independent RGB picture guiding branch is introduced into the R-Net to form a dual-stream coding structure for providing some guiding information such as object positioning and integrity to the main stream branch, thereby ensuring that a relatively robust performance baseline can be obtained. In addition, in order to ensure the effect and efficiency of network training, the invention provides a corresponding training strategy from the aspects of quantity distribution, a training method and reliability judgment, and an alternate increment iteration mechanism and a reliability verification mechanism are respectively designed.

The encoders that modify both data streams of the network extract the corresponding multi-level features based on ResNet-50. The invention then proposes a mixer (BGA) with a steering and aggregation mechanism to implement two-stage feature decoding. The first stage is used for guiding, i.e. supplementing the main flow branch with information of the guiding branch, and ensuring that the main flow branch has a relatively robust baseline performance. The second stage has the effect of aggregation, i.e. integrating the encoder features of the corresponding layer, the decoder features of the previous layer and global features from the top layer of the encoder by taking into account the effects of the different features.

In the first phase of modifying the network, the boot phase, the present invention expects that the RGB branch can provide boot information (such as object localization and integrity) for the main stream branch, thereby ensuring its effective learning and robust performance baseline.

First, to ensure that enough salient information can be transferred to the mainstream branches and to mitigate unreliable noise from the coarse tag input, the present invention supplements and filters the features of the channel dimensions. Specifically, the encoder features of the corresponding layers in the two branches are first concatenated to complement, and then the necessary channel features are highlighted with channel attention for filtering. This process can be expressed as:

wherein

Indicating the complementary feature after the channel attention boost, CA is the channel attention operation,

and

Representing an element-by-element multiplication broadcast with the channel dimension.

Secondly, besides direct supplement of channel dimensions, the RGB branches can also provide pixel-level spatial guidance information, which can both reinforce important areas and suppress irrelevant noise interference. Specifically, the invention uses spatial attention to generate a spatial position mask to be emphasized from the perspective of RGB information, and updates the characteristics of refinement branches in this way:

wherein

Encoder feature of i-th layer for final output of the spatial attention boosting post-boot stage, SA being a spatial attention operation, being an element-by-element multiplication operation, conv _1×1 Represents a convolution layer having a convolution kernel size of 1 × 1.

The second stage is mainly used for realizing the aggregation of multi-stage features, and the encoder features of the corresponding layer generated in the first stage, the global features from the top encoder and the decoder features of the previous layer are integrated. In order to achieve polymerization more efficiently, the effects of various characteristics need to be analyzed. In general, both the encoder features and the global features should assist in the feature decoding stage to obtain better decoder features. The assistance can be divided into two areas: firstly, the characteristics of a decoder are perfected under the guidance of global information; the second is to supplement the decoder features under the direction of the encoder features.

First, global features from the top encoder layer are critical to distinguish salient objects, but as the decoding process progresses, the semantic constraints will gradually fade. Therefore, in order to enforce semantic information throughout the decoding process, the present invention generates a corresponding semantic guide mask to refine the decoder characteristics of each layer. Specifically, the invention first combines the semantic features of the two branches and the encoder features generated in the first stage by an importance weighting strategy:

wherein

Representing signals from two branchesThe fused semantic features of (1), namely the global features from the top-level encoder;

fusion of features. Then, the fusion feature containing the global semantic information

Activated as a semantic mask, for modifying the upsampled decoder features:

wherein

Is a modified decoder feature of the ith layer;

represents the original decoder characteristics of the i +1 th layer, i.e. the decoder characteristics of the previous layer; up represents the Up-sampling operation of bilinear interpolation, and sigma represents sigmoid activation function.

Second, since the encoder features contain much valuable information, the learning of the decoder features can be supplemented, e.g., shallower features include rich spatial information, details can be better recovered, etc. Thus, the present invention further supplements the modified decoder features with the filtered features through a spatial attention mechanism to obtain more comprehensive significance-related decoder features. This process can be expressed as:

wherein

Is a decoder characteristic of the i-th layer modification and SA denotes spatial attention operation.

The training set used in the present invention (e.g., DUTS-TR data set) should contain true labels and coarse labels at the pixel level, where the coarse labels are only used as inputs to S-Net and not as a supervision. Meanwhile, in the network training process, the pseudo label is generated by the network as supervision information. In practice, the present invention randomly selects 1000 samples from the DUTS-TR dataset containing 10000 samples as a training subset with true labels, and generates the corresponding coarse labels for all samples in the DUTS-TR dataset (including the aforementioned 1000 samples) using the minimized grid method (J.Zhang, S.Sclaroff, Z.Lin, X.Shen, B.price, and R.Mech, "Minimum barrier present object detection at 80fps," in Proc.CVPR,2015, pp.1404-1412). Aiming at the mixed data, in order to ensure the effect and efficiency of network training, the invention provides two key training mechanisms, including an alternate increment iteration mechanism and a credibility verification mechanism.

An alternating incremental iteration mechanism. Specifically, the S-Net of the current iteration is trained by using the pseudo label generated by the R-Net of the current iteration, and the pseudo label generated by the trained S-Net is further used for the R-Net training of the next iteration. Secondly, in the weakly supervised saliency target detection framework of mixed labels, another important issue is the sample imbalance caused by the difference in the number of real label samples and coarse label samples. If unbalanced training samples are used directly for network training, network collapse may result. Therefore, the two networks of the present invention are trained in an alternating manner, and training samples are gradually added to the two networks in an incremental manner until all training samples are traversed, which is referred to as an alternating incremental iteration mechanism.

The process of alternating incremental iteration specifically comprises:

first, as described above, 1000 samples in the training set (i.e. the data set, wherein the total number of samples is 10000) are randomly selected as the training subset of the true labels, and the corresponding coarse labels of all the samples are generated by using the minimum grid method.

All training samples including the true label and the coarse label are grouped on average. For example, 10000 samples can be averagely divided into 10 groups named as group 1 to group 10, the number of the samples in each group is 1000, and 1000 samples in group 1 are true labeled samples.

Training the R-Net by using the sample with the real label in the group 1, testing the sample in the group 2 by using the trained R-Net to obtain a corresponding pseudo label, and sending the sample with the real label in the group 1 and the sample with the pseudo label in the group 2 into the S-Net for training again; obtaining a pseudo label for R-Net training in the next iteration by using the retrained samples in the S-Net test group 3;

sequentially, continuously sending the samples with the pseudo labels in the group k into R-Net for training, testing the samples in the group k +1 by using the retrained R-Net to obtain corresponding pseudo labels, and continuously sending the samples with the pseudo labels in the group k +1 into S-Net for retraining; obtaining a pseudo label required by R-Net training in the next iteration by using the retrained sample in the S-Net testing group k + 2;

the initial value of k is 3; each iteration is performed, the k value is increased by 2 compared with the last iteration, and the method specifically comprises the following steps:

when k is an initial value of 3, it corresponds to a second iteration, namely: continuing to send the samples with the pseudo labels in the group 3 into the R-Net for training, testing the samples in the group 4 by using the trained R-Net to obtain corresponding pseudo labels, and continuing to send the samples with the pseudo labels in the group 4 into the S-Net for training the S-Net again; obtaining a pseudo label for R-Net training in the next iteration by using the sample in the retrained S-Net test group 5;

and when the third iteration is performed, the k value is increased by 2 compared with the second iteration, namely 5, and the third iteration specifically comprises the following steps:

continuing to send the samples with the pseudo labels in the group 5 into the R-Net to train the R-Net, testing the samples in the group 6 by using the trained R-Net to obtain corresponding pseudo labels, and continuing to send the samples with the pseudo labels in the group 6 into the S-Net to train again; obtaining a pseudo label for R-Net training in the next iteration by using the sample in the retrained S-Net test group 7;

the fourth and subsequent iterations are analogized as described above until the training is complete, which in this example requires 5 iterations.

The training is completed as all training samples are added to the network for training, i.e., all samples in groups 1 to 10 are trained in R-Net or S-Net.

In addition, because both R-Net and S-Net are trained by mixed labels, in order to ensure the effectiveness and performability of training, the invention firstly trains the network under pseudo label samples, and then finely adjusts the samples of real labels in a training period.

It should be noted that, after the alternate increment iteration mechanism is introduced, the number of sample groups needs to be properly selected according to different sample numbers in the training set.

A trust verification mechanism. The purpose of alternate training of the two networks is to provide better pseudo labels for the other party, so the invention introduces a credibility verification mechanism from the second iteration, and tests are carried out on a verification set comprising 100 pictures to ensure the validity of the provided labels. Only when the performance of the current model is superior to that of the previous best model on the verification set, the invention uses the current model to generate the corresponding group of pseudo labels and participates in the next training, specifically:

comparing the average absolute error MAE scores of the R-Net after the t iteration and the R-Net after the t-1 iteration on the verification set, and selecting the R-Net with less average absolute error MAE scores to test the samples in the group 2t and generate corresponding pseudo labels;

comparing the average absolute error MAE scores of the S-Net trained after the t-th iteration and the S-Net trained after the t-1-th iteration on the verification set, and selecting the S-Net with less average absolute error MAE scores to test samples in the group 2t +1 and generate corresponding pseudo labels;

the initial value of t is 2; and 2t +1 yarn-over n, MAE refers to the average absolute error, which represents the error between the prediction and the ground truth, with smaller numbers meaning better.

The detection process of the invention is as follows: the RGB picture is directly input into a saliency prediction network S-Net after the overall training, and the S-Net automatically detects saliency objects from the picture according to learned knowledge and outputs the saliency objects in a mask mode.

After the RGB picture is input into the S-Net, the S-Net aggregation interaction module effectively utilizes adjacent features through a mutual learning mode, and self-interaction modules are used for adaptively extracting multi-scale information from the picture so as to meet significance targets of different physical sizes, and then the S-Net outputs a significance map. The detection process is described in detail in the prior art documents: L.Wang, H.Lu, Y.Wang, M.Feng, D.Wang, B.Yin, and X.Ruan, "Learning to detect liquid objects with image-level overview," in Proc.CVPR,2017, pp.136-145.

Fig. 2 gives an example of the visualization of the inventive technique. The first column is a color image, the second column is a truth map for RGB saliency target detection, the third column is a coarse saliency map generated by conventional methods, and the fourth column is a predicted saliency map of the present invention. It can be seen from the results that the inventive method achieves better visual results in many challenging scenarios.

Those not described in detail in this specification are within the skill of the art.

Claims

1. A weak supervision significance detection method based on a mixed label is characterized by comprising R-Net and S-Net;

the R-Net refers to a correction network, and the S-Net refers to a significance prediction network; the main stream branch refers to a main stream repair branch and comprises an RGB (red, green and blue) picture and a rough label; the guide branch refers to an independent RGB picture guide branch;

2. The weak surveillance significance detection method based on hybrid labels as claimed in claim 1, characterized in that:

in the above formula, the first and second carbon atoms are,

and

encoder features representing the i-th layer in the main stream branch and the leading branch, respectively, [,]representing cascading operations along a channel dimension;

representing element-by-element multiplication of broadcast with channel dimensions;

in the above formula, the first and second carbon atoms are,

encoder feature of final output of the boot stage after spatial attention enhancement, SA being a spatial attention operation, a one-element multiplication operation, conv _1×1 Represents a convolution layer having a convolution kernel size of 1 × 1.

3. The weak supervision significance detection method based on mixed label as claimed in claim 1 is characterized by:

in the above formula, the first and second carbon atoms are,

encoder features for respective layers; p ⁱ Is an importance weight of learning for controlling f _g And

fusion proportion of features;

then, the fusion feature containing the global semantic information

Decoder features activated as semantic masks for modifying the upsampling expressed as equation (4):

in the above formula, the first and second carbon atoms are,

is a modified decoder feature of the ith layer;

represents the original decoder characteristics of the i +1 th layer, i.e. the decoder characteristics of the previous layer; up represents the Up-sampling operation of bilinear interpolation, and sigma represents a sigmoid activation function;

in the above formula, the first and second carbon atoms are,

4. A hybrid label based weak supervised saliency detection training strategy is characterized by comprising the following steps:

step 2, averagely dividing all training samples including the real labels and the rough labels obtained in the step 1 into n groups, wherein the real labels are classified into the group 1, and other rough labels are classified into the groups 2 to n;

step 3, training R-Net and S-Net by using each group of training samples obtained in the step 2 through an alternating increment iteration mechanism until all the training samples are traversed;

and 4, directly inputting the RGB picture into an S-Net network, and outputting a significance detection result by the S-Net according to the input picture.

5. A hybrid label based weakly supervised saliency training strategy as claimed in claim 4, characterized in that: the process of training R-Net and S-Net by using the alternate increment iteration mechanism in the step 3 specifically comprises the following steps:

3-1, training the R-Net by using the sample with the real label in the group 1, testing the sample in the group 2 by using the trained R-Net to obtain a corresponding pseudo label, and sending the sample with the real label in the group 1 and the sample with the pseudo label in the group 2 into the S-Net to train the S-Net again; obtaining a pseudo label for R-Net training in the next iteration by using the retrained samples in the S-Net test group 3;

step 3-2, continuously sending the samples with the pseudo labels in the group k into R-Net for training, testing the samples in the group k +1 by using the retrained R-Net to obtain corresponding pseudo labels, and continuously sending the samples with the pseudo labels in the group k +1 into S-Net for retraining; obtaining a pseudo label required by R-Net training in the next iteration by using the retrained sample in the S-Net testing group k + 2;

6. The hybrid label-based weakly supervised saliency detection training strategy of claim 5, wherein:

calculating and comparing the average absolute error MAE scores of the trained R-Net after the t-th iteration and the trained R-Net after the t-1-th iteration on a verification set, and selecting the R-Net with the small average absolute error MAE score to test the samples in the group 2t and generate corresponding pseudo labels;

calculating average absolute error MAE scores on a verification set by the trained S-Net after the t-th iteration and the trained S-Net after the t-1-th iteration, comparing the average absolute error MAE scores, and selecting the S-Net with the small average absolute error MAE score to test samples in the group 2t +1 and generate corresponding pseudo labels;

the initial value of t is 2; and 2t +1 yarn-over n, n refers to the training sample grouping number described in step 2.