WO2019136946A1

WO2019136946A1 - Deep learning-based weakly supervised salient object detection method and system

Info

Publication number: WO2019136946A1
Application number: PCT/CN2018/095057
Authority: WO
Inventors: 李冠彬; 谢圆; 成慧; 林倞; 王青
Original assignee: 中山大学
Priority date: 2018-01-15
Filing date: 2018-07-10
Publication date: 2019-07-18
Also published as: CN108399406A; CN108399406B

Abstract

Disclosed in the present invention are a deep learning-based weakly supervised salient object detection method and system, the method comprising: generating salient images of all training images by using an unsupervised salient detection method; training a multi-task full convolutional neural network by using the salient images and corresponding image-level type labels as noisy supervision information for initial iteration, and generating a new type activation image and a salient object prediction image after the training process is converged; adjusting the type activation image and the salient object prediction image by using a conditional random field model; updating saliency labeling information for the next iteration using a label updating policy; performing a training process by multiple iterations until a stop condition is met; and performing general training on a data set comprising unknown types of images so as to obtain a final model. According to the present invention, noise information is automatically eliminated in an optimization process, and a good prediction effect may be achieved by only using image-level labeling information, thereby avoiding a complex and time-consuming pixel-level manual labeling process.

Description

Method and system for weakly supervised significant object detection based on deep learning

Technical field

The present invention relates to the field of computer vision based on deep learning, and in particular to a method and system for weakly supervised saliency object detection based on deep learning.

Background technique

Significant object detection refers to accurately locating the most attractive areas of human visual attention in the image. In recent years, this technology has been used in many different visual technologies, which has stimulated a lot of research work in computer vision and cognitive science.

In recent years, the successful use of convolutional neural networks has brought significant breakthroughs in significant detection techniques, such as G.Li et al.'s 2015 work "Visual saliency based on multiscale deep features" (IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015), and N. Liu et al. "Deep hierarchical saliency network for salient object detection" (In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 678-686, 2016). However, these methods based on deep learning theory using convolutional neural networks to ensure performance are required to have sufficient and high-quality pixel-level annotation information as training samples. However, for saliency detection, pixel-level annotation is very difficult, even for experienced labelers, it takes a few minutes to mark a map. In addition, since the definition of saliency is subjective, in order to ensure the quality of training, after the completion of the manual labeling stage, the labeling information needs to be further deleted, and the controversial labeling is removed, and the entire labeling work takes a lot of labor and time. This limits the amount of data for pixel-level training data, and this limitation has further become a bottleneck for performance-enhanced methods.

On the other hand, there are also a large number of unsupervised methods in this field, such as the earlier work of Y.Wei, F.Wen, W.Zhu, and J.Sun “Geodesic saliency using background priors” (In European conference on Computer vision, pages 29–42. Springer, 2012), and recent studies by M.-M. Cheng et al. Global contrast based salient region detection. (IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3): 569–582 , 2015). These methods are usually based on some low-level features such as color, position, background prior information, etc., which leads to the fact that such methods are always suitable for specific categories of images, but not for good prediction of all images. These methods based on low-level features have the common disadvantage that most of the errors detected are due to the lack of spatial correlation and image semantic considerations.

Summary of the invention

In order to overcome the deficiencies of the prior art described above, the object of the present invention is to provide a weakly supervised significant object detection method and system based on deep learning, which effectively combines a supervised and unsupervised saliency detection method in the optimization process. The noise information can be automatically cleared, and only the image-level annotation information can be used to achieve a good prediction effect, thereby avoiding the cumbersome and time-consuming pixel-level manual labeling process.

To achieve the above and other objects, the present invention provides a weakly supervised significant object detection method based on deep learning, comprising the following steps:

Step S1, using the unsupervised saliency detection method to generate a saliency map S _{anno of} all training images through the multi-task full convolution neural network;

Step S2, using the saliency map and the corresponding image level category label as the noisy supervised information of the initial iteration to train the multi-tasked full convolutional neural network, and after the training process converges, generate a new category activation map. S _cam and significant object prediction map S _predict ;

Step S3, adjusting the category activation map and the saliency object prediction map by using a conditional random field model;

Step S4, using the label update strategy to update the saliency annotation information for the next iteration;

Step S5, performing the training process of steps S2-S4 multiple iterations until the condition of stopping is met;

In step S6, generalization training is performed on the data set containing the image of the unknown category to obtain the final model.

Preferably, in step S1, data set training data containing image category information is selected, and an unsupervised saliency detection method is selected, and pixel-level significantness is generated for all training samples by the multi-task full convolutional neural network. Figure.

Preferably, any deep neural network model is selected as a pre-training model of the full convolutional neural network, and the last linear classification layer of the deep neural network model is replaced by a linear convolution layer, and the last two downsampling layers in the network are removed. And use the expansion convolution algorithm to increase the expansion rate in the last two layers of the convolutional layer.

Preferably, in the multi-task full convolutional neural network, the full convolutional neural network is replicated 3 times, each sub-network corresponds to image input at one scale, 3 networks share weights, and 3 networks The output is scaled to the original size of the image using the linear difference method, and the pixel level addition process is performed, and the softmax layer is input to generate the final probability map.

Preferably, step S2 further comprises:

Training the multi-tasked full convolutional neural network with the saliency map generated in step S1 and the corresponding manually labeled category information as saliency map pseudo-tags and category labels, respectively;

After the training process converges, a new saliency object prediction map is generated using the trained full convolutional neural network, and the category activation map is generated using the multi-task full convolution neural network in combination with the category activation mapping technique.

Preferably, the feature maps of the three scales of the multi-task full convolutional neural network are connected, and after a global average pooling layer, the further processed features are obtained, and then a fully connected layer is input, thereby obtaining Category distribution output.

Preferably, in step S3, the saliency map S _anno generated in step S1 is processed by using the conditional random field model to adjust the category activation map S _cam and the saliency map S _predict generated in step S2 to generate more spatial synergy and stronger. Predictive graph of margin preservation, recorded as C _anno , C _cam , C _predict .

Preferably, in step S4, the tag update policy generates a new saliency map pseudo tag by using a class activation map for guidance and appropriate threshold determination.

Preferably, the label update policy is specifically as follows:

If MAE(C _anno , C _predict ) ≤ α, then

Otherwise, if MAE(C _anno , C _cam )>β and MAE(C _predict ,C _cam )>β, the training sample is removed during the next iteration training;

Otherwise, if MAE (C _anno , C _cam ) ≤ MAE (C _predict , C _cam ), then

Otherwise S _update =C _predict

Among them, MAE is the average error rate, CRF is the conditional random field algorithm, and α and β are preset thresholds.

To achieve the above object, the present invention also provides a weakly supervised significant object detection system based on deep learning, which is characterized in that:

Saliency map generating unit saliency detection methods for using unsupervised training to generate all the image saliency map S _anno convolutional neural network by the full multi-tasking;

a training unit, configured to use the saliency map and the corresponding image level category label as the noisy supervised information of the initial iteration to train the multi-tasked full convolutional neural network, and generate a new category after the training process converges Activation map S _cam and significant object prediction map S _predict ;

An adjustment unit for adjusting the category activation map and the saliency object prediction map by using a conditional random field model;

An update unit, configured to update the saliency annotation information for the next iteration by using a label update policy;

An iterative training unit for performing a training process of the training unit, the adjusting unit, and the updating unit in multiple iterations until the condition of stopping is met;

The second stage training unit is used to perform generalization training on the data set containing the image of the unknown category after the first stage training is stopped, to obtain the final model.

Compared with the prior art, a method and system for detecting a weak object based on weak learning based on deep learning of the present invention generates a saliency map of all training images by using an unsupervised saliency detection method, and a category label corresponding to the image level At the same time, as the first iteration of the noisy supervised information, it is used to train the multi-tasked full convolutional neural network. After the training process converges, the new class activation map and the saliency object prediction map are generated through the multi-task neural network, and the conditions are used. The random field model adjusts the category activation map and the saliency map, uses the label update strategy to update the label information for the next iteration, performs the above training process through multiple iterations until the condition of the stop is met, and finally performs the pan on the data set containing the image of the unknown category. The training method is used to obtain the final model. The method proposed by the present invention effectively exploits and corrects the ambiguity of the significant object prediction map generated by the traditional unsupervised method in the absence of the pixel level label. The final result exceeds all existing significant object detection Unsupervised methods in the field.

DRAWINGS

1 is a flow chart showing the steps of a method for weakly supervising significant object detection based on deep learning according to the present invention;

2 is a structural diagram of a multi-task full convolutional neural network in a specific embodiment of the present invention;

3 is a schematic diagram of an iterative training process according to a specific embodiment of the present invention;

4 is a system architecture diagram of a weakly supervised significant object detection system based on deep learning according to the present invention.

Detailed ways

The embodiments of the present invention will be described by way of specific examples and the accompanying drawings, and those skilled in the art can readily understand the advantages and advantages of the present invention. The present invention may be embodied or applied in various other specific embodiments, and various modifications and changes may be made without departing from the spirit and scope of the invention.

1 is a flow chart of steps of a weakly supervised significant object detection method based on deep learning according to the present invention. As shown in FIG. 1 , a weakly supervised significant object detection method based on deep learning includes the following steps:

In step S1, a saliency map of all training images is generated by a multi-task full convolutional neural network using an unsupervised saliency detection method. Specifically, in step S1, a data set containing image category information is selected as the training data of the first stage, and the data set is generally used for image detection, and an unsupervised saliency detection method is selected, which is The convolutional neural network generates a pixel-level saliency map for all training samples, denoted as _Sanno .

The present invention can select any deep neural network model with better performance, such as ResNet (residual network), GoogleNet, etc. as a pre-training model of the full convolutional neural network. 2 is a structural diagram of a multi-task full convolutional neural network in a specific embodiment of the present invention. In the specific embodiment of the present invention, a 101-layer ResNet (residual network) is used, and the network structure is modified as needed, but not limited thereto. specifically,

First, the linear classification layer with 1000 outputs at the end of the residual network is replaced by a linear convolution layer, which outputs the feature maps of the two channels. In addition, in order to obtain a higher resolution feature map, refer to L.-C. Chen, et al., "Semantic image segmentation with deep convolutional nets and fully connected crfs" (arXiv preprint arXiv: 1412.7062, 2014), removed The last two downsampling layers in the network, and using the dilation algorithm algorithm to increase the dilation rate in the convolution layer of the last two layers to increase the range of the receptive field. After such processing, the final output of the network is resolved. The rate is 1/8 of the original resolution.

Since the scale of the significant object is large, in order to more accurately detect the objects at different scales, the present invention copies the 101-layer residual network three times, each sub-network corresponding to one scale input, and three networks. Sharing the weight, the output of the three networks is scaled to the original size of the image by the linear difference method, and the pixel level addition processing is performed, and then the softmax layer is input to generate a final probability map, that is, a saliency map of the training image.

Step S2, using the saliency map and the corresponding image level category label as the noisy supervised information of the initial iteration to train the multi-tasked full convolutional neural network, and after the training process converges, generate a new category activation map. And significant object prediction maps.

Specifically, step S2 further includes:

Step S201, the saliency map generated by the step S1 and the corresponding manually labeled category information are used as the saliency map pseudo-tag and the category label respectively to train the multi-tasked full convolutional neural network;

Step S202, after the training process of step S201 converges, use the trained full convolutional neural network to generate a new saliency object prediction map, denoted as S _predict , and use the neural network to generate category activation using the category activation mapping technique. Figure, recorded as S _cam .

As shown in FIG. 2, for the classification task of the image, refer to the paper "Learning deep features for discriminative localization" (In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921 - 2929, 2016) of B. Zhou et al. After the feature maps at each scale are connected, a further globalized pooling layer is obtained, and further processed features are obtained, and then a fully connected layer is input, thereby obtaining a category distribution output.

Use f _k (x, y) to represent the activation value of the (x, y) spatial position of the connected feature in the kth channel.

Indicates the unit k (through the global pooling operation, each channel of the connected feature map becomes a unit activation value) corresponding to the weight of the category c. Defining M _c as the category activation map for the c-th category, its value at each position is obtained by the following formula:

In step S3, the category activation map and the saliency object prediction map are adjusted by using the conditional random field model. Specifically, in step S3, the saliency map S _anno generated in step S1 is processed by using the conditional random field model to adjust the category activation map S _cam and the saliency map S _predict generated in step S2 to generate more spatial synergy and stronger. The prediction map of the margin preservation is correspondingly recorded as C _anno , C _cam , C _predict .

In a specific embodiment of the present invention, the present invention embeds a graph model to fine tune the salient map. Specifically, the graph model is based on a conditional random field, which can improve the spatial correlation and edge preservation of the predicted image.

In particular, the proposed model solves a binary pixel-level annotation problem using the following energy formula:

Where L represents the saliency label calibrated for all pixels, l _i =1 indicates that the ith pixel is significant, and l _i =0 indicates that the ith pixel is not significant. P(l _i ) is the probability that the pixel x _i corresponds to the label l _i . When initializing, P(1)=S _i , P(0)=1-S _{i is set} , and S is a saliency map to be processed, correspondingly, S _{i is} the significance score of the salient map of the process at position x _i , and θ _ij (l _i , l _j ) is a pairwise value between positions, which is calculated by the following formula:

Where p is the position vector, I is the color vector, w is the weight of the linear combination, and σ _α , σ _β , σ _γ are hyperparameters that control the degree of proximity and similarity.

Wherein, when l _i ≠l _j , μ(l _i , l _j )=1, otherwise 0. θ _{ij is} composed of two cores. The first kernel relies on the position of the pixel and the color value at that location, causing adjacent pixels with similar colors to get similar significance scores. The second kernel relies on the relationship between pixels, trying to remove small isolated areas.

The output of the entire graph model is a probability plot, and the value of each position indicates the probability that the pixel at that location is a significant pixel. Preferably, the probability map can be converted into a binary map by a certain threshold as a pseudo-label during training.

In step S4, the tag update policy is used to update the salient tag information for the next iteration. Specifically, using the tag update policy, according to the steps S _anno , S _cam , S _predict , C _anno , C _cam , and C _predict generated by the above steps, the saliency tag information of the next iteration is generated, and is recorded as S _update .

In a specific embodiment of the present invention, the label update policy uses a category activation map for guidance and a suitable threshold determination to generate a new saliency map pseudo-label. The specific label update strategy is as follows:

If MAE (C _anno , C _predict ) ≤ α

Then

Otherwise if MAE(C _anno , C _cam )>β and MAE(C _predict ,C _cam )>β

Then remove this training sample during the next iteration of training.

Otherwise if MAE (C _anno , C _cam ) ≤ MAE (C _predict , C _cam )

Then

otherwise

S _update =C _predict

In step S5, the training process of steps S2-S4 is performed iteratively a plurality of times until the condition of the stop is met. Specifically, steps S2, S3, and S4 are alternately performed until the first stage of training is stopped when the set stop condition is satisfied.

Preferably, after step S5, the deep object-based weakly supervised significant object detection method of the present invention further comprises the following steps:

In step S6, generalization training is performed on the data set containing the image of the unknown category to obtain the final model. Specifically, one or two significant detection data sets are selected as the training data of the second stage. Unlike the first stage, the data of this stage contains objects of unknown category, and the data is used for the full convolutional neural network. Fine-tuning training is performed, and the final model is obtained when the training process converges.

FIG. 3 is a schematic diagram of an iterative training process according to a specific embodiment of the present invention. In the specific embodiment of the present invention, the training of the entire weakly supervised saliency map is divided into two stages, which are based on an iterative training strategy, and the process of each iteration is as shown in FIG.

In the first phase, the present invention selects Microsoft's COCO data set for training, which is a large data set widely used for object detection, which has one or more category labels for each training image. First, select a well-functioning unsupervised saliency detection model to generate an initial saliency map for all training samples, as a significant pseudo-label for the first training, and then combine these pseudo-labels with the corresponding image-level category labels as a supervising Information, training multi-tasked full convolutional neural network, when the training process converges, select the best performing model on the verification set as the final model of the training process, and use it to generate new saliency maps for the entire training data set and Category activation map. In a particular embodiment of the invention, the model is optimized using the following loss function:

(1) Euclidean distance loss function:

among them,

Represents the nth sample label, y _n represents the nth sample predictor

(2) sigmoid cross entropy loss function

Where N is the total number of samples and p _n is the nth sample label.

Represents the nth sample prediction value.

Second, a new training tuple (image, saliency map, and image category label) is generated for the next iteration using the saliency label update strategy. The above training process is iteratively repeated until the conditions of the stop are met. After each training session, the MAE (mean error rate) between the pseudo-label of the process and the new saliency map generated by the full convolutional neural network is calculated on the verification set, when the average error rate is below a certain threshold (may be Preset) indicates that the model has achieved the desired fit and can end the training.

In the second training phase, in order to improve the generalization ability of the model, the model can also perform saliency detection on images containing unknown image tags, and the data set in the saliency detection (MSRA-B, HKU-IS) is required. Further fine-tuning is performed. At this stage, the average value of the five category activation maps with the highest response value is used as a guide map.

4 is a system architecture diagram of a weakly supervised significant object detection system based on deep learning according to the present invention. As shown in FIG. 4, a weakly supervised significant object detection system based on deep learning includes:

A saliency map generation unit 401 for generating a saliency map of all training images by a multi-task full convolutional neural network using an unsupervised saliency detection method. Specifically, the saliency map generation unit 401 selects the data set containing the image category information as the training data of the first stage, and the data set is usually used for image detection, and selects an unsupervised saliency detection method, which utilizes the full multitasking The convolutional neural network generates a pixel-level saliency map for all training samples, denoted as _Sanno .

The present invention can select any deep neural network model, such as ResNet (residual network), GoogleNet, etc., as a pre-training model of the full convolutional neural network. In a specific embodiment of the present invention, as shown in FIG. 2, a residual network of 101 layers is selected as a pre-training model of the full convolutional neural network, and the network structure is modified as needed, specifically,

Since the scale of the significant object is large, in order to more accurately detect the objects at different scales, the present invention copies the 101-layer residual network three times, each sub-network corresponding to one scale input, and three networks. Sharing the weight, the output of the three networks is scaled to the original size of the image by the linear difference method, and the pixel level addition process is performed, and then the softmax layer is input to generate the final probability map.

The training unit 402 is configured to use the saliency map and the corresponding image level category label as the noisy supervised information of the initial iteration to train the multi-tasked full convolutional neural network, and generate a new after the training process converges. Category activation maps and significant object prediction maps.

Specifically, the training unit 402 is specifically configured to:

Training the multi-tasked full convolutional neural network with the saliency map generated by the saliency map generation unit 401 and the corresponding manually labeled category information as the saliency map pseudo-tag and the category label, respectively;

After the training process converges, the trained total convolutional neural network is used to generate a new saliency object prediction map, which is denoted as S _predict , and the network activation map is combined with the category activation mapping technique to generate a category activation map, which is denoted as S _cam .

The adjusting unit 403 is configured to adjust the category activation map and the saliency object prediction map by using a conditional random field model. Specifically, the adjustment unit 403 processes the saliency map S _anno generated by the saliency map generation unit 401 by using the conditional random field model to adjust the category activation map S _cam and the saliency map S _predict generated by the training unit 402 to generate a more spatial synergy relationship and The prediction map of stronger margin preservation is correspondingly recorded as C _anno , C _cam , C _predict .

The updating unit 404 is configured to update the tag information for the next iteration using the tag update policy. Specifically, the update unit 404 uses the tag update policy to generate the S _anno , S _cam , S _predict , C _anno , C _cam , C _predict according to the above steps. Generate a significant graph label for the next iteration, labeled S _update .

The iterative training unit 405 is configured to perform the training process of the training unit 402, the adjusting unit 403, and the updating unit 404 multiple iterations until the condition of stopping is met. Specifically, the training unit 402, the adjustment unit 403, and the update unit 404 are alternately performed until the training of the first stage is stopped when the set stop condition is satisfied.

The second stage training unit 406 is configured to perform generalization training on the data set containing the image of the unknown category after the first stage training is stopped, to obtain a final model. Specifically, the second stage training unit 406 selects one or two significant detection data sets as the training data of the second stage. Unlike the first stage, the data of this stage contains objects of unknown categories, and the data is used. Fine-tuning the whole convolutional neural network, and finally obtaining the final model when the training process converges.

In summary, the method and system for detecting significant objects based on weak learning based on deep learning of the present invention generate a saliency map of all training images by using an unsupervised saliency detection method, simultaneously with the corresponding image level category label. The noisy supervised information of the initial iteration is used to train the multi-tasked full convolutional neural network. After the training process converges, the new class activation map and the saliency object prediction map are generated through the multi-task neural network, and the conditional random field is used. The model adjusts the category activation map and the saliency map, uses the label update strategy to update the label information for the next iteration, performs the above training process through multiple iterations until the condition of the stop is met, and finally performs generalization on the data set containing the image of the unknown category. Training, to obtain the final model, the method proposed by the present invention effectively exploits and corrects the ambiguity of the significant object prediction map generated by the traditional unsupervised method in the absence of the pixel level label, and finally digs and corrects the ambiguity of the significant object prediction map generated by the traditional unsupervised method. The effect exceeds all existing areas of significant object detection Supervision methods.

The above-described embodiments are merely illustrative of the principles of the invention and its effects, and are not intended to limit the invention. Modifications and variations of the above-described embodiments can be made by those skilled in the art without departing from the spirit and scope of the invention. Therefore, the scope of protection of the invention should be as set forth in the claims.

Claims

A weakly supervised significant object detection method based on deep learning, comprising the following steps:

Step S1, using the unsupervised saliency detection method to generate a saliency map S anno of all training images through the multi-task full convolution neural network;

Step S2, using the saliency map and the corresponding image level category label as the noisy supervised information of the initial iteration to train the multi-tasked full convolutional neural network, and after the training process converges, generate a new category activation map. S cam and significant object prediction map S predict ;

Step S3, adjusting the category activation map and the saliency object prediction map by using a conditional random field model;

Step S4, using the label update strategy to update the saliency annotation information for the next iteration;

Step S5, performing the training process of steps S2-S4 multiple iterations until the condition of stopping is met;

In step S6, generalization training is performed on the data set containing the image of the unknown category to obtain the final model.
A deep learning-based weakly supervised saliency object detecting method according to claim 1, wherein in step S1, data set training data containing image category information is selected, and an unsupervised saliency detection is selected. In the method, a pixel-level saliency map is generated for all training samples by the multi-task full convolutional neural network.
A deep learning-based weakly supervised saliency object detecting method according to claim 1, wherein any deep neural network model is selected as a pre-training model of the full convolutional neural network, and the deep neural network model is finally The linear classification layer is replaced by a linear convolutional layer, the last two downsampling layers in the network are removed, and the expansion rate is increased in the convolutional layers of the last two layers using an expansion convolution algorithm.
A deep learning-based weakly supervised saliency object detecting method according to claim 3, wherein in the multi-task full convolutional neural network, the full convolutional neural network is copied three times, Each sub-network corresponds to image input at one scale, three networks share weights, and the output of three networks is scaled to the original size of the image by linear difference method. The pixel level is added and input to the softmax layer to generate the final result. Probability map.
The method of claim 1, wherein the step S2 further comprises:

Training the multi-tasked full convolutional neural network with the saliency map generated in step S1 and the corresponding manually labeled category information as saliency map pseudo-tags and category labels, respectively;

After the training process converges, a new saliency object prediction map is generated using the trained full convolutional neural network, and the category activation map is generated using the multi-task full convolution neural network in combination with the category activation mapping technique.
The method for detecting a weakly supervised saliency object based on deep learning according to claim 5, wherein the feature maps of the three scales of the multi-task full convolutional neural network are connected, and then The global average pooling layer is further processed, and then a fully connected layer is input to obtain the category distribution output.
A deep learning-based weakly supervised saliency object detecting method according to claim 1, wherein in step S3, the saliency map S anno generated in step S1 is processed by the conditional random field model to adjust step S2. The generated category activation map S cam and the saliency map S predict generate a more predictive graph with spatial synergy and stronger edge retention, denoted as C anno , C cam , C predict .
A deep learning-based weakly supervised saliency object detecting method according to claim 7, wherein in step S4, said tag update strategy uses a class activation map for guidance and appropriate threshold determination to generate a new significant Figure pseudo label.
The method of claim 8, wherein the label update strategy is as follows:

If MAE(C anno , C predict ) ≤ α, then

Otherwise, if MAE(C anno , C cam )>β and MAE(C predict ,C cam )>β, the training sample is removed during the next iteration training;

Otherwise, if MAE (C anno , C cam ) ≤ MAE (C predict , C cam ), then

Otherwise S update =C predict

Among them, MAE is the average error rate, CRF is the conditional random field algorithm, and α and β are preset thresholds.
A weakly supervised significant object detection system based on deep learning, characterized in that:

Saliency map generating unit saliency detection methods for using unsupervised training to generate all the image saliency map S anno convolutional neural network by the full multi-tasking;

a training unit, configured to use the saliency map and the corresponding image level category label as the noisy supervised information of the initial iteration to train the multi-tasked full convolutional neural network, and generate a new category after the training process converges Activation map S cam and significant object prediction map S predict ;

An adjustment unit for adjusting the category activation map and the saliency object prediction map by using a conditional random field model;

An update unit, configured to update the saliency annotation information for the next iteration by using a label update policy;

An iterative training unit for performing a training process of the training unit, the adjusting unit, and the updating unit in multiple iterations until the condition of stopping is met;

The second stage training unit, after the above training is stopped, performs generalization training on the data set containing the image of the unknown category to obtain the final model.