CN112651940B

CN112651940B - Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network

Info

Publication number: CN112651940B
Application number: CN202011558989.XA
Authority: CN
Inventors: 钱晓亮; 成曦; 岳伟超; 赵艺芳; 曾黎; 程塨; 姚西文; 吴青娥; 任航丽; 刘向龙; 王芳; 刘玉翠
Original assignee: Zhengzhou University of Light Industry
Current assignee: Zhengzhou University of Light Industry
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-09-17
Anticipated expiration: 2040-12-25
Also published as: CN112651940A

Abstract

The invention provides a collaborative visual saliency detection method based on a dual-encoder generation type countermeasure network, which comprises the following steps: constructing a dual-encoder generated confrontation network model and pre-training; the pre-trained parameters are used for generating a countermeasure network model; inputting the collaborative saliency data into a classification network module by a group of images, extracting multi-scale group-level image semantic category features, and fusing the multi-scale group-level image semantic category features into inter-group saliency features by a multi-scale semantic fusion module; sequentially inputting the grouped input images into a significance encoder in a single sheet mode to obtain a single significance characteristic; respectively carrying out pixel-level addition on the single-frame saliency features and the inter-group saliency features to obtain collaborative saliency features, and inputting the collaborative saliency features into a decoder to decode to obtain a detection image; and detecting the trained generative confrontation network model by utilizing the cooperative significance data set. The invention has the advantages of smaller model parameter, simple training and detection operation, higher detection precision and improved efficiency.

Description

Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network

Technical Field

The invention relates to the technical field of cooperative significance detection, in particular to a cooperative visual significance detection method based on a dual-encoder generation type countermeasure network.

Background

With the continuous development of the internet and multimedia, a great deal of image and video data accompany our daily lives, and how to use the existing multimedia technology to quickly and effectively obtain useful information becomes very important. A popular collaborative saliency detection technology is a computer vision technology that simulates the visual attention mechanism of human beings, and can find a common salient object in each image from a group of images that have similar salient objects and have correlation between the images. The method can effectively acquire the information wanted by people and filter redundant information in the image, thereby achieving the purposes of reducing computer storage and improving the efficiency of calculation.

The cooperative significance method has two key links, and better single-significance characteristics are extracted and the similarity among a plurality of images is mined. The methods of synergistic significance that exist today can be divided into two categories: conventional manual methods and deep learning methods. The traditional manual method obtains the similarity of images among groups through manual features, and the manual features have strong subjectivity and cannot capture a significant target well; the current popular deep learning method utilizes a neural network model to obtain deep characteristics to well describe images, and simultaneously utilizes an end-to-end model mode to better mine the similarity between the images, so that the mode well improves the precision of cooperative significance detection. However, the end-to-end model requires a group of input images to mine the similarity of images between groups, so that the model requires a large amount of data to implement, and the detection effect of the cooperative significance is limited by the data set label.

The features extracted by the current end-to-end neural network model are the whole image and not the common salient target area, so that the semantic consistency of the inter-group images (namely the salient features of the inter-group images) cannot be well mined.

Disclosure of Invention

Aiming at the technical problems of insufficient sample labels, poor interclass saliency characteristics and incapability of well mining semantic consistency of interclass images in the traditional collaborative saliency detection method, the invention provides a collaborative visual saliency detection method based on a dual-encoder generation type countermeasure network.

In order to achieve the purpose, the technical scheme of the invention is realized as follows: a collaborative visual saliency detection method based on a dual-encoder generation type countermeasure network comprises the following steps:

the method comprises the following steps: constructing a dual-encoder generation type confrontation network model: the dual-encoder generative confrontation network model comprises a generator and a discriminator, wherein the generator comprises two encoders and a decoder, the two encoders comprise a significance encoder and an inter-group semantic encoder, and the significance encoder, the decoder and the discriminator form a significance generative confrontation network;

step two: pre-training: on one hand, pre-training a significance generation type confrontation network module by using the existing single significance data set; on the other hand, the co-significant dataset is divided into two parts: and a training set and a testing set, wherein the classification network module of the semantic encoder between groups is pre-trained by utilizing the class label of the training set to obtain pre-trained parameters of the significance generation type confrontation network module and the classification network module.

Step three: performing cooperative significance training on the dual-coder generated confrontation network model by using a training set of the cooperative significance data set: using the pre-trained parameters in the step two as parameter settings for initializing the dual-encoder generation type confrontation network model; inputting a group of images in the collaborative significance data set into a classification network module, extracting multi-scale group-level image semantic category features by the classification network module, and fusing the multi-scale group-level image semantic category features into inter-group significance features by the multi-scale semantic fusion module; sequentially inputting each image of the grouped input images into a significance encoder to obtain a single significance characteristic, respectively carrying out pixel-level addition on the single significance characteristic and the inter-group significance characteristic to obtain a synergistic significance characteristic of each image, inputting the synergistic significance characteristic into a decoder to decode, generating a detection image of each image, and judging by a discriminator to form antagonistic training;

step four: and detecting the generated confrontation network model obtained after the step three training by utilizing the test set of the cooperative significance data set to realize cooperative significance detection.

The generative countermeasure network model in the first step is formed based on a full convolution network, and the whole significance generative countermeasure network adopts a U-Net structure; the significance encoder and the decoder of the significance generation type countermeasure network form a U-Net structure, more image information is obtained in a short connection mode, a generator has 17 layers of full convolution in total, the significance encoder comprises 8 layers of full convolution, and the discriminator comprises 9 layers of full convolution; the discriminator adopts a patch-level discrimination structure, has 5 full convolution layers in total, converts the whole image into a size of 28 multiplied by 28, and compares the size of the whole image with each element on a 28 multiplied by 28 label matrix to be less; the generator uses F-loss andl₁norm loss function, and the discriminator uses F-loss function.

The inter-group semantic encoder consists of a classification module and a multi-scale semantic fusion module, wherein the classification network module adopts a Resnet50 model trained in ImageNet, the last full connection layer is changed into the category number of a pre-training data set in pre-training, and BCE-loss is used for optimization; the structure of the multi-scale semantic fusion module is as follows:

in the pre-training in the second step, on one hand, for the significance generating type confrontation network module, the training data set is a data set of public prevalence significance; in the countermeasure training process of the significance generation type countermeasure network module, firstly fixing a generator, and training a discriminator so as to update parameters of the discriminator; then fixing a discriminator to train the generator, so that the parameters of the generator are updated, repeating the process, continuously and circularly optimizing the generator and the discriminator, and finally determining the model parameters of the significance generation type countermeasure network; and for the classification network module, the pre-trained labels are class labels of a training set, and the model parameters of the classification network module are determined after pre-training.

The loss function of the significance generation countermeasure network module is:

wherein G is_SAnd D_SGenerators and discriminators, G, representing significance generating countermeasure network modules, respectively_SSum D_SRepresenting parameters obtained after the generator and the discriminator are trained; d_S(. and G)_SDenotes the output of the generator of the saliency-generated confrontation network module and the output of the discriminator, respectively, x and y denote the input image and the corresponding true value map, respectively, and F (-) and E (-) denoteF-loss function and BCE-loss function; l is_A(. and L)_L1(. cndot.) denotes the opposing F-loss function and the L1 loss function, respectively; the magnitude of the coefficient lambda of L1-loss is 100,

representing a splice operator along the path; real and Fake represent a tag matrix of all 1's and all 0's, respectively; i | · | purple wind₁Is represented by₁And (4) norm.

The generative confrontation network model in the step II inherits all the parameters and loss functions of the pre-training in the step II, and then carries out cooperative and significant joint confrontation training by utilizing the generative confrontation network model of the dual encoders: firstly, inputting a group of 5 images into a pre-trained classification network module to obtain 4 group-level classification semantic features with different scales, then splicing the 4 group-level classification semantic features into 4 features with the sizes of 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024 and 7 × 7 × 204, inputting the 4 features into a multi-scale semantic fusion module, obtaining a feature with a uniform size of 28 × 28 × 512 by up-sampling, down-sampling, convolution and pixel-level addition in the multi-scale semantic fusion module, wherein the feature has a steady class semantic consistency as an inter-group significance feature, and the method specifically operates as the following formula:

wherein X ═ { X ═ X₁,x₂,…x_NN-5 denotes the number of input images, f_i(x_j) Representing extracted images x from a classifying network module_jThe value range of j is N, namely {1,2,3,4,5} for the features under the ith scale; f. of₁(x_j)～f₄(x_j) Are 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024, and 7 × 7 × 204, respectively; f. of_i' (X) denotes the features at the original i-th scale obtained for a set of images; after treatment of f₁(X)～f₄The sizes of (X) are all 28 × 28 × 512; this is achieved byIn addition, Conv (. cndot.), U_p(·)、D_n(. a) and

functions representing convolution operation, up-sampling, down-sampling and pixel-level addition, respectively; f_inter(X) represents the finally obtained inter-group significance signature.

In the third step, each image in a group of 5 images is sequentially input into a pre-trained significance encoder, so that the significance characteristics of each image can be obtained, the pixel-level addition can be performed on the significance characteristics of the groups of the 5 images to obtain the synergistic significance characteristics of each image, the synergistic significance characteristics are input into a decoder to perform joint countermeasure training of a dual-encoder generated countermeasure network model, and the loss function of the dual-encoder generated countermeasure network is as follows:

wherein G is_TSAnd D_TSA generator and a discriminator respectively representing a dual-encoder generative confrontation network model; g_TS ^*And D_TS ^*Parameters of the generator and the discriminator after training are respectively represented; l is_TA(. and L)_TL1(. h) opposing F-loss function and L1 loss function representing dual encoder generated opposing network, y_jRepresenting an input image x_jTrue value map of.

In the fourth step, the significance encoder and the group-level semantic encoder which are finally obtained after the training in the third step are used for detecting the rest 50% of the collaborative significance data set of the training classification network module in the second step of labeling the category in the collaborative significance data set, and the formula is as follows:

M_co-saliency＝G_TS ^*(x_j,X)；

wherein M is_co-saliencyA graph representing the co-saliency results of the final test, X representing a set of related images, G_TS ^*Denotes dual encoder generated countermeasure after trainingThe output of the generator of the network.

Compared with the prior art, the invention has the beneficial effects that: firstly, constructing a generating type confrontation network model of a double encoder, then carrying out progressive training of two stages on the generating type confrontation network model of the double encoder, and in the first stage, pre-training partial modules of a network, wherein the partial modules comprise a significance generating type confrontation network module and a classification network module, so that the significance generating encoder and the classification network module respectively obtain the capacity of learning single significance and class semantic identification; and in the second stage, the parameters after the training in the first stage are inherited, on one hand, the classification network module is used for obtaining the multi-scale group-level class semantic features, the multi-scale group-level class semantic features are input into the multi-scale semantic fusion module to obtain better inter-group significance features, on the other hand, the significance encoder of the significance generation type countermeasure network module is used for obtaining single significance features, and then the inter-group significance features and the single significance features are fused and sent into a decoder, so that the combined countermeasure type training is carried out. And finally, performing cooperative significance detection by using the two trained encoders. The method has the advantages of smaller model parameters, simple training and detection operation, better universality, higher detection precision and higher efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of the present invention.

Fig. 2 is a comparison of the subjective results of the present invention and the existing algorithm on the CoSal2015 database.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, a collaborative visual saliency detection method based on a dual-encoder generative confrontation network includes the following steps:

the method comprises the following steps: constructing a generative confrontation network model based on a dual encoder: the generative confrontation network model of the dual encoders comprises a generator and a discriminator, wherein the generator comprises two encoders and a decoder, the two encoders comprise a significance encoder and an inter-group semantic encoder, and the significance encoder, the decoder and the discriminator form a significance generative confrontation network.

According to the characteristics of the task of detecting the cooperative significance and the characteristics of the model of the existing antagonistic generating network, the basic framework of the existing antagonistic generating network is improved and innovated, a model which accords with the task of detecting the cooperative significance is built, and the model of the generating antagonistic network is based on the progressive training of a double encoder.

The generative confrontation network model constructed by the invention comprises two parts: the system comprises a generator and a discriminator, wherein the generator comprises a significance encoder, an inter-group semantic encoder, two encoders and a decoder, and the whole network is formed based on a full convolution network. In addition, the generator part constructs the structure of U-Net. For the loss function, the generator uses F-loss and l₁Norm loss function, and the discriminator uses F-loss function. The design references for the F-loss function used are K.ZHao, S.Gao, W.Wang et al, "Optimizing the F-Measure for Threshold-Free sale Object Detection," in Proc.IEEE int.Conf.Compout.Vision, Oct.2019. pp.8849-8857.

The significance encoder and decoder constitute a U-Net structure whose design concept is described in the references O.Ronneberger, P.Fischer, and T.Brox, "U-Net: computational network for biological Image Segmentation," in Proc. medical Image Computing and Computer-Assisted interaction, Dec.2015. pp.234-241. In addition, the saliency encoder and decoder have several layers of dropout adopted in the training process, and the value of the dropout is set to be 0.5, so that the model has better generalization capability, and the specific setting is shown in table 1. The generators of the significance generating confrontation network module (significance generating confrontation network (17) comprises significance encoder (8) and discriminator (9)) have a total of 17 layers of full convolution

Table 1 structural composition of significance encoder and decoder:

the discriminator adopts a patch-level discrimination structure, which has a total of 5 full convolution layers, the size of the patch is 28 × 28, the whole Image is converted into the size, and the size is compared with each element on a 28 × 28 label matrix for loss, the design idea of the patch-level discrimination structure is referred to as p.isola, j.

TABLE 2 structural composition of the discriminator

For the construction of an intergroup semantic encoder, the intergroup semantic encoder consists of a classification module and a multi-scale semantic fusion module, wherein the classification network module adopts a Resnet50 model trained in ImageNet, and in the pre-training process of the network, the last full connection layer is changed into the category number of a pre-training data set. In addition, BCE-loss was used to optimize the model. The specific operation of the multi-scale semantic fusion module is shown in table 3.

TABLE 3 structural composition of multiscale semantic fusion modules

Step two: pre-training: on one hand, pre-training a significance generation type confrontation network module by using the existing single significance data set; on the other hand, we split the co-significant dataset into two parts: training and testing sets, and pre-training the classification network module by using class labels of the training sets; and obtaining parameters of pre-training of the two network modules.

For the significance generation type confrontation network module, the existing single significance data set IS used for pre-training, the data set IS the data set HKU-IS and PASCAL1500 which disclose popular significance, so that a significance encoder in the significance generation type confrontation network module has the capability of extracting single significance detection, in the confrontation training process, a generator IS fixed firstly, a discriminator IS trained, and therefore parameters of the discriminator are updated; the arbiter is then fixed and the generator is trained so that the generator parameters are updated, and the operations are repeated continuously, continuously optimizing the generator (significance encoder and decoder) and the arbiter. Through continuous parameter debugging, the learning rate, the training times and the batch size are set to be 0.0002, 300 and 1, and the detection effect is good. And the loss function of the significance generating antagonistic network module can be expressed as:

wherein G is_SAnd D_SGenerators and discriminators, G, representing significance generating countermeasure network modules, respectively_SSum D_SRepresenting parameters obtained after the generator and the discriminator are trained; d_S(. and G)_SEach represents significanceThe generator of the generative confrontation network module and the discriminator are generated, x and y respectively represent the input image and the corresponding true value graph, and F (-) and E (-) respectively represent the F-loss function and the BCE-loss function. L is_A(. and L)_L1(. cndot.) denotes the opposing F-loss function and the L1 loss function, respectively; the magnitude of the coefficient lambda of L1-loss is 100,

In order to better solve the problem of extracting consistency of inter-group image features (inter-group image saliency features) in subsequent training, the essence of images of a collaborative saliency data set is fully utilized: each set of co-saliency images has features of the same class. For the classification network module, the classification network used is the Resnet-50 model trained in ImageNet. For data, a 50% collaborative saliency training set (iCoseg and Cosal2015) only containing class labels is used for pre-training, the last layer is modified into the number of classes of the training set before pre-training, and a classification network module for collaborative saliency class training has recognition capacity for classes of the data set, so that the classification network module has good class semantic distinguishing capacity, and in addition, preparation is made for obtaining good semantic consistency in the next training stage. Later, through debugging, when the learning rate, the training times and the batch size of the classification network module are respectively set to be 0.0002, 1000 and 8, the classification precision is better.

Step three: performing cooperative significance training on the generative confrontation network model of the whole dual-encoder by using a cooperative significance training set: using the pre-trained parameters in the step two for a generative confrontation network model of a double encoder; inputting a group of (5) images in a training set into a classification network module, extracting multi-scale group-level image semantic category features by the classification network module, and fusing the multi-scale group-level image semantic category features into inter-group significance features by a multi-scale semantic fusion module; and sequentially inputting each image of a group of images into a significance encoder to obtain a single significance characteristic, respectively carrying out pixel-level addition on the single significance characteristic and the inter-group significance characteristic to obtain a synergistic significance characteristic of each image, inputting the synergistic significance characteristic into a decoder to be decoded to obtain a detection image for generating each image, and judging by a discriminator.

Adjusting parameters: and (3) for the parameter adjustment of the generating type countermeasure network of the double encoders, starting the setting of training parameters, and except for carrying out random initialization processing on the parameters of the multi-scale fusion model, using the parameters after the first-stage training by other modules. Starting training, firstly, fixing parameters of a generator comprising two encoders, and optimizing and updating parameters of a discriminator; then, the parameters of the discriminator are fixed, and the parameters of the generator containing the two encoders are optimally updated. And repeating the two processes continuously to obtain the parameters of the final generation type countermeasure network of the double encoders. Note that in step two, the class labels of the collaborative significance training set are used for training the classification network; and in the third step, the truth label of the synergistic significance training set is used. Two different types of tags for the same dataset are used.

The generative confrontation network model of the invention inherits all the parameters and settings of loss functions (L1-loss and F-loss) of the pre-training of the step two, and then utilizes the dual-encoder generative confrontation network to carry out the cooperative and significant united confrontation training. Firstly, inputting a group of images (the number of the images is 5) into a pre-trained classification network module to obtain 4 group-level distinguishing and classifying semantic features with different scales, then splicing the 4 group-level distinguishing and classifying semantic features into 4 features, wherein the sizes of the 4 features are respectively 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024 and 7 × 7 × 204, then inputting the 4 features into a multi-scale semantic fusion module, and obtaining a feature with a uniform size, the size of the feature is 28 × 28 × 512, the feature has steady class semantic consistency as an inter-group significance feature through up-sampling, down-sampling, convolution and pixel-level addition in the multi-scale semantic fusion module, and the operation is specifically represented by the following formula:

wherein X ═ { X ═ X₁,x₂,…x_NN-5 denotes the number of input images, f_i(x_j) Representing extracted images x from a classifying network module_jThe value range of j is N, namely {1,2,3,4,5} for the features under the ith scale; f. of₁(x_j)～f₄(x_j) Are 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024, and 7 × 7 × 204, respectively; f. of_i' (X) denotes the original i-th scale features obtained for a set of images; after treatment of f₁(X)～f₄The sizes of (X) are all 28 × 28 × 512; furthermore, Conv (. cndot.), U_p(·)、D_n(. a) and

The invention utilizes an inter-group semantic encoder to extract the characteristics of better inter-group image semantic consistency (namely the salient characteristics of inter-group images), adopts grouped cooperative salient images (5 images which are a group of related images) to be input into a pre-trained classification network module to obtain multi-scale group-level image semantic classification characteristics, and a multi-scale semantic fusion module fuses the multi-scale group-level image semantic classification characteristics to generate the characteristics with the inter-group image semantic consistency. Each group of the collaborative saliency images is adopted to be input into a saliency encoder in turn, and the saliency encoder inherits all parameters of an encoder of a pre-training saliency generation type confrontation network module, so that the network can better extract the single saliency characteristics of each collaborative saliency image.

Simultaneously, each image in a group of 5 images is sequentially and singly input into a pre-trained significance encoder, so that the significance characteristics of each image can be obtained, the pixel-level addition can be carried out on the significance characteristics of the 5 images and the inter-group significance characteristics to obtain the cooperative significance characteristics of each image, the cooperative significance characteristics are input into a decoder to carry out the joint countermeasure type training of the cooperative significance by a double-encoder generation countermeasure network, and the joint countermeasure type cooperative significance training is effectively carried out. The loss function of the dual-encoder generated countermeasure network is similar to the loss function of the significance generated countermeasure network module in the second step, and the loss function can be expressed as:

g in the formula_TSAnd D_TSA generator and a discriminator respectively representing a generative confrontation network model of the dual encoder; g_TS ^*And D_TS ^*A generator and a discriminator which respectively represent parameters obtained after training; l is_TA(. and L)_TL1(. DEG) denotes the opposing F-loss function and L1 loss function of a two-stream encoder-generated opposing network, y_jRepresenting an input image x_jTrue value map of.

Step four: and detecting a generative confrontation network model obtained after the step three training by using the other part of the cooperative significance data set to realize cooperative significance detection.

Detecting the remaining 50% of the collaborative significance data sets (iCoseg and Cosal2015) of the class label step two training classification network module in the collaborative significance data set by using a significance encoder and a group-level semantic encoder finally obtained after training in the step three, wherein the formula can be expressed as follows:

M_co-saliency＝G_TS ^*(x_j,X) (4)

wherein M is_co-saliencyA graph representing the co-saliency results of the final test, X representing a set of related images, G_TS ^*(. cndot.) represents the output of the generator of the trained dual encoder generative countermeasure network. The detection of a group of images with the same category of cooperative significance is equivalent to the task of completing one cooperative significance detection.

Hardware configuration of implementation of the invention: intel (R) Xeon 5-2650 v4@2.2Hz × 12 cores × 2CPU, NVIDIA TITAN RTX @24G × 8GPU, workstation of 512G memory was tested, the platform configuration of its software: ubuntu16.04, python3.65, pytorch 0.41.

In addition, to better demonstrate the performance and efficiency of the present invention, the present invention performs a subjective comparison and a temporal comparison of each test on the existing popular public data set Cosal 2015. One of the compared algorithms is 8, namely ESMG, CBCS, AUW, SACS-R, LDAW, GW and RCAN, of which 4 algorithms disclose codes, ESMG, CBCS, SACS, wherein, ESMG is from Image and Video Co-localization With Frank-Wolfe Algorithm, CBCS is from Cluster-Based Co-localization Detection, AUW is from A Universal metallic Learning-Based Framework for Co-localization Detection, SACS is from Self-adaptive Weighted Co-localization Detection, LDAW is from Co-localization Detection deletion and depletion parameter Constraint, SACS-R is from Self-adaptive Weighted Co-localization Detection Constraint, LDAW is from Co-localization Detection deletion and depletion parameter Constraint, GW is from Deep Group-boundary Detection Constraint Network Co-localization Detection function, and GW is from gradient-routing function, gradient-routing protection Co-localization Detection Constraint. In the invention, under the same hardware configuration, the comparison experiment of the detection time is carried out on the 4 published algorithms, and the result is shown in table 4.

Table 4 comparison of detection times of Cosal2015 data set with 4 popular algorithms

Comparison algorithm	SACS-R	SACS	ESMG	CBCS	The invention
						Code	MATLAB	MATLAB	MATLAB	MATLAB	Python
One picture detecting time(s)	8.873	2.652	1.723	1.688	0.785

According to the experimental comparison results of fig. 2 and table 4, the effect of the synergistic significance prediction graph of the invention in fig. 2 is closer to the effect of the corresponding truth graph, and is obviously higher than the effect of the prediction graphs of other algorithms; in addition, in comparison with the detection time per image of the algorithm of the disclosed code in table 4, the detection time of the present invention needs to be the shortest. Thus, the performance and effect of the invention are better than those of other popular algorithms.

The invention adopts a confrontation type generation network of two streaming encoders based on a progressive training mode, fully utilizes the label of single significance data and the category label of cooperative significance in the pre-training of the first stage, and relieves the problem of insufficient data labels of an end-to-end model; on one hand, the existing single significant data set is fully used for pre-training the significance generating type antagonistic network module, so that better initialized parameters are provided for the significance generating type antagonistic network module, and the significance encoder has better capacity of extracting single significance characteristics; on the other hand, the class labels of the collaborative significance data sets are fully utilized to train the classification network modules in the network, so that the classification network modules have better class semantic recognition characteristics. In the second stage of cooperative significance training, the invention inherits all parameters learned in the previous stage, provides a multi-scale semantic fusion module to integrate multi-scale group-level semantic features of a classification network module to obtain an inter-group significance feature with steady class semantic consistency, obtains a single significance feature of each image by using a significance encoder, and then inputs the inter-group significance feature and the single significance feature into the encoder for joint countermeasure training. Meanwhile, the training and detection of the invention are simple, the model parameter is smaller, the universality is better, and the detection precision and efficiency are higher.

The method comprises the steps that an intergroup semantic encoder comprising a classification network module and a multi-scale semantic fusion module carries out progressive training, in the first stage, a classification network module is pre-trained by utilizing a collaborative saliency data set only comprising class labels, and the class training enables the classification network module to have the characteristic of recognizing the classes of the collaborative saliency data set (namely the class semantic features of a common target); and in the second stage, firstly, the grouped collaborative saliency images are input into a pre-trained classification module to obtain multi-scale group-level image category semantic features, then the multi-scale group-level image category semantic features are fused by using a multi-scale semantic fusion module, and finally, a feature with better inter-group image semantic consistency is generated, so that collaborative saliency training is carried out.

The problem of alleviating the cooperative significance signature sample deficiency comprises two aspects, which are determined according to two key characteristics of cooperative significance: one is the saliency feature of a single frame and the other is the saliency feature of an inter-group image. Therefore, on one hand, the significance generation type confrontation network module is pre-trained by utilizing the existing single significance data set, so that the significance encoder can extract better significance characteristics; on the other hand, the essence of the image with the co-saliency dataset is: each group of collaborative saliency images has the characteristic of the same category, a classification network module is pre-trained by using a collaborative saliency data set only containing category labels, so that the classification network module for collaborative saliency category training has the capability of identifying the category of the data set, a foundation is laid for extracting the semantic consistency of the images among groups in the next stage, and in the second stage of collaborative saliency training, a truth value icon label in the pre-training data of the classification network module is input to a semantic encoder among groups, so that the characteristics of better semantic consistency of the images among groups (the saliency characteristics of the images among groups) are extracted.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A collaborative visual saliency detection method based on a dual-encoder generation type confrontation network is characterized by comprising the following steps:

step two: pre-training: on one hand, pre-training a significance generation type confrontation network module by using the existing single significance data set; on the other hand, the co-significant dataset is divided into two parts: training and testing sets, namely pre-training a classification network module of the inter-group semantic encoder by using class labels of the training sets to obtain pre-training parameters of a significance generation type confrontation network module and the classification network module;

step three: performing cooperative significance training on the dual-coder generated confrontation network model by using a training set of the cooperative significance data set: utilizing the pre-trained parameters in the step two as parameters for initializing a dual-encoder generated countermeasure network model; inputting a group of images in the collaborative significance data set into a classification network module, extracting multi-scale group-level image semantic category features by the classification network module, and fusing the multi-scale group-level image semantic category features into inter-group significance features by the multi-scale semantic fusion module; sequentially inputting each image of the grouped input images into a significance encoder to obtain a single significance characteristic, respectively carrying out pixel-level addition on the single significance characteristic and the inter-group significance characteristic to obtain a synergistic significance characteristic of each image, inputting the synergistic significance characteristic into a decoder to decode, generating a detection image of each image, and judging by a discriminator to form antagonistic training;

L_L1(G_S)＝E(||G_S(x)-y||₁)

wherein G is_SAnd D_SGenerators and discriminators, G, representing significance generating countermeasure network modules, respectively_SSum D_SRepresenting parameters obtained after the generator and the discriminator are trained; d_S(. and G)_S(. cndot.) respectively represents the output of the generator of the significance generating confrontation network module and the output of the discriminator, x and y respectively represent the input image and the corresponding true value graph, and F (-) and E (-) respectively represent the F-loss function and the BCE-loss function; l is_A(. and L)_L1(. cndot.) denotes the opposing F-loss function and the L1 loss function, respectively; the magnitude of the coefficient lambda of L1-loss is 100,

representing a splice operator along the path; real and Fake, respectivelyA label matrix representing all 1 s and all 0 s; i | · | purple wind₁Is represented by₁A norm;

2. The collaborative visual saliency detection method based on dual-encoder generative countermeasure network of claim 1, characterized in that the generative countermeasure network model in the first step is based on a full convolution network composition, and the whole saliency generative countermeasure network adopts a structure of U-Net; the significance encoder and the decoder of the significance generation type countermeasure network form a U-Net structure, more image information is obtained in a short connection mode, a generator has 17 layers of full convolution in total, the significance encoder comprises 8 layers of full convolution, and the discriminator comprises 9 layers of full convolution; the discriminator adopts a patch-level discrimination structure, has 5 full convolution layers in total, converts the whole image into a size of 28 multiplied by 28, and compares the size of the whole image with each element on a 28 multiplied by 28 label matrix to be less; the generator uses F-loss and l₁Norm loss function, and the discriminator uses F-loss function.

3. The collaborative visual saliency detection method based on dual-encoder generative confrontation network of claim 1 or 2, characterized in that the inter-group semantic encoder is composed of a classification module and a multi-scale semantic fusion module, wherein the classification network module adopts a Resnet50 model trained in ImageNet, in pre-training, the last layer of fully-connected layer is changed into the number of classes of the pre-training data set, and is optimized by BCE-loss; the structure of the multi-scale semantic fusion module is as follows: respectively splicing four features 56 × 56 × 256 × 5, 28 × 28 × 512 × 5, 14 × 14 × 1024 × 5 and 7 × 7 × 2048 × 5 of different scales output by the classification network module as input to obtain four features 56 × 56 × 1280, 28 × 28 × 2560, 14 × 14 × 5120 and 7 × 7 × 10240 of different scales; then convolution operation is respectively carried out to unify the number of channels, and four characteristics with different scales of 56 multiplied by 256, 28 multiplied by 256, 14 multiplied by 256 and 7 multiplied by 256 are obtained; then respectively carrying out down-sampling operation to obtain four characteristics of 28 multiplied by 512; finally, four 28 × 28 × 512 features are added at the pixel level to obtain a 28 × 28 × 512 feature.

4. The collaborative visual saliency detection method based on dual-encoder generative confrontation network of claim 3, characterized in that the pre-training in the second step is, on one hand, for the saliency-generative confrontation network module, the training data set is the data set disclosing the popular saliency; in the countermeasure training process of the significance generation type countermeasure network module, firstly fixing a generator, and training a discriminator so as to update parameters of the discriminator; then fixing a discriminator to train the generator, so that the parameters of the generator are updated, repeating the process, continuously and circularly optimizing the generator and the discriminator, and finally determining the model parameters of the significance generation type countermeasure network; and for the classification network module, the pre-trained labels are class labels of a training set, and the model parameters of the classification network module are determined after pre-training.

5. The method for detecting collaborative visual saliency based on dual-encoder generative confrontation network as claimed in claim 4, wherein the generative confrontation network model in step three inherits all parameters and settings of loss function of the pre-training in step two, and then performs collaborative saliency joint confrontation training using the generative confrontation network model of dual-encoder: firstly, inputting a group of 5 images into a pre-trained classification network module to obtain 4 group-level classification semantic features with different scales, then splicing the 4 group-level classification semantic features into 4 features with the sizes of 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024 and 7 × 7 × 204, inputting the 4 features into a multi-scale semantic fusion module, obtaining a feature with a uniform size of 28 × 28 × 512 by up-sampling, down-sampling, convolution and pixel-level addition in the multi-scale semantic fusion module, wherein the feature has a steady class semantic consistency as an inter-group significance feature, and the method specifically operates as the following formula:

f₁(X)＝Dn(Conv(f′1(X)))

f₂(X)＝Conv(f′₂(X))

f₃(X)＝Up(Conv(f′₃(X)))

f₄(X)＝Up(Conv(f′₄(X)))

wherein X ═ { X ═ X₁,x₂,…x_NN-5 denotes the number of input images, f_i(x_j) Representing extracted images x from a classifying network module_jThe value range of j is N, namely {1,2,3,4,5} for the features under the ith scale; f. of₁(x_j)～f₄(x_j) Are 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024, and 7 × 7 × 204, respectively; f. of_i' (X) denotes the features at the original i-th scale obtained for a set of images; after treatment of f₁(X)～f₄The sizes of (X) are all 28 × 28 × 512; furthermore, Conv (. cndot.), U_p(·)、D_n(. a) and

6. The method as claimed in claim 5, wherein in the third step, each image in a group of 5 images is sequentially input to a pre-trained saliency encoder, so that saliency characteristics of each image can be obtained, such that pixel-level addition can be performed on the saliency characteristics between the groups of the 5 images to obtain the co-saliency characteristics of each image, and the co-saliency characteristics are input to a decoder for joint antagonistic training of a model of the dual-encoder generative antagonistic network, and a loss function of the dual-encoder generative antagonistic network is:

L_TL1(G_TS)＝E(||G_TS(x_j,X)-y_j||₁)

7. The collaborative visual saliency detection method based on dual-encoder generated confrontation network of claim 6 characterized in that in said fourth step, the saliency encoder and the group-level semantic encoder finally obtained after the training in the third step are used to detect the remaining 50% collaborative saliency dataset of the classification network module trained in the second step of class labeling in the collaborative saliency dataset, and the formula is:

M_co-saliency＝G_TS ^*(x_j,X)；

wherein M is_co-saliencyA graph representing the co-saliency results of the final test, X representing a set of related images, G_TS ^*Representing a trained dual-coder generative countermeasure networkThe output of the generator.