CN112651940B - Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network - Google Patents

Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network Download PDF

Info

Publication number
CN112651940B
CN112651940B CN202011558989.XA CN202011558989A CN112651940B CN 112651940 B CN112651940 B CN 112651940B CN 202011558989 A CN202011558989 A CN 202011558989A CN 112651940 B CN112651940 B CN 112651940B
Authority
CN
China
Prior art keywords
significance
encoder
training
saliency
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011558989.XA
Other languages
Chinese (zh)
Other versions
CN112651940A (en
Inventor
钱晓亮
成曦
岳伟超
赵艺芳
曾黎
程塨
姚西文
吴青娥
任航丽
刘向龙
王芳
刘玉翠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University of Light Industry
Original Assignee
Zhengzhou University of Light Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University of Light Industry filed Critical Zhengzhou University of Light Industry
Priority to CN202011558989.XA priority Critical patent/CN112651940B/en
Publication of CN112651940A publication Critical patent/CN112651940A/en
Application granted granted Critical
Publication of CN112651940B publication Critical patent/CN112651940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a collaborative visual saliency detection method based on a dual-encoder generation type countermeasure network, which comprises the following steps: constructing a dual-encoder generated confrontation network model and pre-training; the pre-trained parameters are used for generating a countermeasure network model; inputting the collaborative saliency data into a classification network module by a group of images, extracting multi-scale group-level image semantic category features, and fusing the multi-scale group-level image semantic category features into inter-group saliency features by a multi-scale semantic fusion module; sequentially inputting the grouped input images into a significance encoder in a single sheet mode to obtain a single significance characteristic; respectively carrying out pixel-level addition on the single-frame saliency features and the inter-group saliency features to obtain collaborative saliency features, and inputting the collaborative saliency features into a decoder to decode to obtain a detection image; and detecting the trained generative confrontation network model by utilizing the cooperative significance data set. The invention has the advantages of smaller model parameter, simple training and detection operation, higher detection precision and improved efficiency.

Description

Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
Technical Field
The invention relates to the technical field of cooperative significance detection, in particular to a cooperative visual significance detection method based on a dual-encoder generation type countermeasure network.
Background
With the continuous development of the internet and multimedia, a great deal of image and video data accompany our daily lives, and how to use the existing multimedia technology to quickly and effectively obtain useful information becomes very important. A popular collaborative saliency detection technology is a computer vision technology that simulates the visual attention mechanism of human beings, and can find a common salient object in each image from a group of images that have similar salient objects and have correlation between the images. The method can effectively acquire the information wanted by people and filter redundant information in the image, thereby achieving the purposes of reducing computer storage and improving the efficiency of calculation.
The cooperative significance method has two key links, and better single-significance characteristics are extracted and the similarity among a plurality of images is mined. The methods of synergistic significance that exist today can be divided into two categories: conventional manual methods and deep learning methods. The traditional manual method obtains the similarity of images among groups through manual features, and the manual features have strong subjectivity and cannot capture a significant target well; the current popular deep learning method utilizes a neural network model to obtain deep characteristics to well describe images, and simultaneously utilizes an end-to-end model mode to better mine the similarity between the images, so that the mode well improves the precision of cooperative significance detection. However, the end-to-end model requires a group of input images to mine the similarity of images between groups, so that the model requires a large amount of data to implement, and the detection effect of the cooperative significance is limited by the data set label.
The features extracted by the current end-to-end neural network model are the whole image and not the common salient target area, so that the semantic consistency of the inter-group images (namely the salient features of the inter-group images) cannot be well mined.
Disclosure of Invention
Aiming at the technical problems of insufficient sample labels, poor interclass saliency characteristics and incapability of well mining semantic consistency of interclass images in the traditional collaborative saliency detection method, the invention provides a collaborative visual saliency detection method based on a dual-encoder generation type countermeasure network.
In order to achieve the purpose, the technical scheme of the invention is realized as follows: a collaborative visual saliency detection method based on a dual-encoder generation type countermeasure network comprises the following steps:
the method comprises the following steps: constructing a dual-encoder generation type confrontation network model: the dual-encoder generative confrontation network model comprises a generator and a discriminator, wherein the generator comprises two encoders and a decoder, the two encoders comprise a significance encoder and an inter-group semantic encoder, and the significance encoder, the decoder and the discriminator form a significance generative confrontation network;
step two: pre-training: on one hand, pre-training a significance generation type confrontation network module by using the existing single significance data set; on the other hand, the co-significant dataset is divided into two parts: and a training set and a testing set, wherein the classification network module of the semantic encoder between groups is pre-trained by utilizing the class label of the training set to obtain pre-trained parameters of the significance generation type confrontation network module and the classification network module.
Step three: performing cooperative significance training on the dual-coder generated confrontation network model by using a training set of the cooperative significance data set: using the pre-trained parameters in the step two as parameter settings for initializing the dual-encoder generation type confrontation network model; inputting a group of images in the collaborative significance data set into a classification network module, extracting multi-scale group-level image semantic category features by the classification network module, and fusing the multi-scale group-level image semantic category features into inter-group significance features by the multi-scale semantic fusion module; sequentially inputting each image of the grouped input images into a significance encoder to obtain a single significance characteristic, respectively carrying out pixel-level addition on the single significance characteristic and the inter-group significance characteristic to obtain a synergistic significance characteristic of each image, inputting the synergistic significance characteristic into a decoder to decode, generating a detection image of each image, and judging by a discriminator to form antagonistic training;
step four: and detecting the generated confrontation network model obtained after the step three training by utilizing the test set of the cooperative significance data set to realize cooperative significance detection.
The generative countermeasure network model in the first step is formed based on a full convolution network, and the whole significance generative countermeasure network adopts a U-Net structure; the significance encoder and the decoder of the significance generation type countermeasure network form a U-Net structure, more image information is obtained in a short connection mode, a generator has 17 layers of full convolution in total, the significance encoder comprises 8 layers of full convolution, and the discriminator comprises 9 layers of full convolution; the discriminator adopts a patch-level discrimination structure, has 5 full convolution layers in total, converts the whole image into a size of 28 multiplied by 28, and compares the size of the whole image with each element on a 28 multiplied by 28 label matrix to be less; the generator uses F-loss andl1norm loss function, and the discriminator uses F-loss function.
The inter-group semantic encoder consists of a classification module and a multi-scale semantic fusion module, wherein the classification network module adopts a Resnet50 model trained in ImageNet, the last full connection layer is changed into the category number of a pre-training data set in pre-training, and BCE-loss is used for optimization; the structure of the multi-scale semantic fusion module is as follows:
Figure BDA0002859733240000021
in the pre-training in the second step, on one hand, for the significance generating type confrontation network module, the training data set is a data set of public prevalence significance; in the countermeasure training process of the significance generation type countermeasure network module, firstly fixing a generator, and training a discriminator so as to update parameters of the discriminator; then fixing a discriminator to train the generator, so that the parameters of the generator are updated, repeating the process, continuously and circularly optimizing the generator and the discriminator, and finally determining the model parameters of the significance generation type countermeasure network; and for the classification network module, the pre-trained labels are class labels of a training set, and the model parameters of the classification network module are determined after pre-training.
The loss function of the significance generation countermeasure network module is:
Figure BDA0002859733240000031
wherein G isSAnd DSGenerators and discriminators, G, representing significance generating countermeasure network modules, respectivelySSum DSRepresenting parameters obtained after the generator and the discriminator are trained; dS(. and G)SDenotes the output of the generator of the saliency-generated confrontation network module and the output of the discriminator, respectively, x and y denote the input image and the corresponding true value map, respectively, and F (-) and E (-) denoteF-loss function and BCE-loss function; l isA(. and L)L1(. cndot.) denotes the opposing F-loss function and the L1 loss function, respectively; the magnitude of the coefficient lambda of L1-loss is 100,
Figure BDA0002859733240000033
representing a splice operator along the path; real and Fake represent a tag matrix of all 1's and all 0's, respectively; i | · | purple wind1Is represented by1And (4) norm.
The generative confrontation network model in the step II inherits all the parameters and loss functions of the pre-training in the step II, and then carries out cooperative and significant joint confrontation training by utilizing the generative confrontation network model of the dual encoders: firstly, inputting a group of 5 images into a pre-trained classification network module to obtain 4 group-level classification semantic features with different scales, then splicing the 4 group-level classification semantic features into 4 features with the sizes of 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024 and 7 × 7 × 204, inputting the 4 features into a multi-scale semantic fusion module, obtaining a feature with a uniform size of 28 × 28 × 512 by up-sampling, down-sampling, convolution and pixel-level addition in the multi-scale semantic fusion module, wherein the feature has a steady class semantic consistency as an inter-group significance feature, and the method specifically operates as the following formula:
Figure BDA0002859733240000032
wherein X ═ { X ═ X1,x2,…xNN-5 denotes the number of input images, fi(xj) Representing extracted images x from a classifying network modulejThe value range of j is N, namely {1,2,3,4,5} for the features under the ith scale; f. of1(xj)~f4(xj) Are 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024, and 7 × 7 × 204, respectively; f. ofi' (X) denotes the features at the original i-th scale obtained for a set of images; after treatment of f1(X)~f4The sizes of (X) are all 28 × 28 × 512; this is achieved byIn addition, Conv (. cndot.), Up(·)、Dn(. a) and
Figure BDA0002859733240000034
functions representing convolution operation, up-sampling, down-sampling and pixel-level addition, respectively; finter(X) represents the finally obtained inter-group significance signature.
In the third step, each image in a group of 5 images is sequentially input into a pre-trained significance encoder, so that the significance characteristics of each image can be obtained, the pixel-level addition can be performed on the significance characteristics of the groups of the 5 images to obtain the synergistic significance characteristics of each image, the synergistic significance characteristics are input into a decoder to perform joint countermeasure training of a dual-encoder generated countermeasure network model, and the loss function of the dual-encoder generated countermeasure network is as follows:
Figure BDA0002859733240000041
wherein G isTSAnd DTSA generator and a discriminator respectively representing a dual-encoder generative confrontation network model; gTS *And DTS *Parameters of the generator and the discriminator after training are respectively represented; l isTA(. and L)TL1(. h) opposing F-loss function and L1 loss function representing dual encoder generated opposing network, yjRepresenting an input image xjTrue value map of.
In the fourth step, the significance encoder and the group-level semantic encoder which are finally obtained after the training in the third step are used for detecting the rest 50% of the collaborative significance data set of the training classification network module in the second step of labeling the category in the collaborative significance data set, and the formula is as follows:
Mco-saliency=GTS *(xj,X);
wherein M isco-saliencyA graph representing the co-saliency results of the final test, X representing a set of related images, GTS *Denotes dual encoder generated countermeasure after trainingThe output of the generator of the network.
Compared with the prior art, the invention has the beneficial effects that: firstly, constructing a generating type confrontation network model of a double encoder, then carrying out progressive training of two stages on the generating type confrontation network model of the double encoder, and in the first stage, pre-training partial modules of a network, wherein the partial modules comprise a significance generating type confrontation network module and a classification network module, so that the significance generating encoder and the classification network module respectively obtain the capacity of learning single significance and class semantic identification; and in the second stage, the parameters after the training in the first stage are inherited, on one hand, the classification network module is used for obtaining the multi-scale group-level class semantic features, the multi-scale group-level class semantic features are input into the multi-scale semantic fusion module to obtain better inter-group significance features, on the other hand, the significance encoder of the significance generation type countermeasure network module is used for obtaining single significance features, and then the inter-group significance features and the single significance features are fused and sent into a decoder, so that the combined countermeasure type training is carried out. And finally, performing cooperative significance detection by using the two trained encoders. The method has the advantages of smaller model parameters, simple training and detection operation, better universality, higher detection precision and higher efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of the present invention.
Fig. 2 is a comparison of the subjective results of the present invention and the existing algorithm on the CoSal2015 database.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, a collaborative visual saliency detection method based on a dual-encoder generative confrontation network includes the following steps:
the method comprises the following steps: constructing a generative confrontation network model based on a dual encoder: the generative confrontation network model of the dual encoders comprises a generator and a discriminator, wherein the generator comprises two encoders and a decoder, the two encoders comprise a significance encoder and an inter-group semantic encoder, and the significance encoder, the decoder and the discriminator form a significance generative confrontation network.
According to the characteristics of the task of detecting the cooperative significance and the characteristics of the model of the existing antagonistic generating network, the basic framework of the existing antagonistic generating network is improved and innovated, a model which accords with the task of detecting the cooperative significance is built, and the model of the generating antagonistic network is based on the progressive training of a double encoder.
The generative confrontation network model constructed by the invention comprises two parts: the system comprises a generator and a discriminator, wherein the generator comprises a significance encoder, an inter-group semantic encoder, two encoders and a decoder, and the whole network is formed based on a full convolution network. In addition, the generator part constructs the structure of U-Net. For the loss function, the generator uses F-loss and l1Norm loss function, and the discriminator uses F-loss function. The design references for the F-loss function used are K.ZHao, S.Gao, W.Wang et al, "Optimizing the F-Measure for Threshold-Free sale Object Detection," in Proc.IEEE int.Conf.Compout.Vision, Oct.2019. pp.8849-8857.
The significance encoder and decoder constitute a U-Net structure whose design concept is described in the references O.Ronneberger, P.Fischer, and T.Brox, "U-Net: computational network for biological Image Segmentation," in Proc. medical Image Computing and Computer-Assisted interaction, Dec.2015. pp.234-241. In addition, the saliency encoder and decoder have several layers of dropout adopted in the training process, and the value of the dropout is set to be 0.5, so that the model has better generalization capability, and the specific setting is shown in table 1. The generators of the significance generating confrontation network module (significance generating confrontation network (17) comprises significance encoder (8) and discriminator (9)) have a total of 17 layers of full convolution
Table 1 structural composition of significance encoder and decoder:
Figure BDA0002859733240000061
the discriminator adopts a patch-level discrimination structure, which has a total of 5 full convolution layers, the size of the patch is 28 × 28, the whole Image is converted into the size, and the size is compared with each element on a 28 × 28 label matrix for loss, the design idea of the patch-level discrimination structure is referred to as p.isola, j.
TABLE 2 structural composition of the discriminator
Figure BDA0002859733240000062
For the construction of an intergroup semantic encoder, the intergroup semantic encoder consists of a classification module and a multi-scale semantic fusion module, wherein the classification network module adopts a Resnet50 model trained in ImageNet, and in the pre-training process of the network, the last full connection layer is changed into the category number of a pre-training data set. In addition, BCE-loss was used to optimize the model. The specific operation of the multi-scale semantic fusion module is shown in table 3.
TABLE 3 structural composition of multiscale semantic fusion modules
Figure BDA0002859733240000063
Figure BDA0002859733240000071
Step two: pre-training: on one hand, pre-training a significance generation type confrontation network module by using the existing single significance data set; on the other hand, we split the co-significant dataset into two parts: training and testing sets, and pre-training the classification network module by using class labels of the training sets; and obtaining parameters of pre-training of the two network modules.
For the significance generation type confrontation network module, the existing single significance data set IS used for pre-training, the data set IS the data set HKU-IS and PASCAL1500 which disclose popular significance, so that a significance encoder in the significance generation type confrontation network module has the capability of extracting single significance detection, in the confrontation training process, a generator IS fixed firstly, a discriminator IS trained, and therefore parameters of the discriminator are updated; the arbiter is then fixed and the generator is trained so that the generator parameters are updated, and the operations are repeated continuously, continuously optimizing the generator (significance encoder and decoder) and the arbiter. Through continuous parameter debugging, the learning rate, the training times and the batch size are set to be 0.0002, 300 and 1, and the detection effect is good. And the loss function of the significance generating antagonistic network module can be expressed as:
Figure BDA0002859733240000072
wherein G isSAnd DSGenerators and discriminators, G, representing significance generating countermeasure network modules, respectivelySSum DSRepresenting parameters obtained after the generator and the discriminator are trained; dS(. and G)SEach represents significanceThe generator of the generative confrontation network module and the discriminator are generated, x and y respectively represent the input image and the corresponding true value graph, and F (-) and E (-) respectively represent the F-loss function and the BCE-loss function. L isA(. and L)L1(. cndot.) denotes the opposing F-loss function and the L1 loss function, respectively; the magnitude of the coefficient lambda of L1-loss is 100,
Figure BDA0002859733240000073
representing a splice operator along the path; real and Fake represent a tag matrix of all 1's and all 0's, respectively; i | · | purple wind1Is represented by1And (4) norm.
In order to better solve the problem of extracting consistency of inter-group image features (inter-group image saliency features) in subsequent training, the essence of images of a collaborative saliency data set is fully utilized: each set of co-saliency images has features of the same class. For the classification network module, the classification network used is the Resnet-50 model trained in ImageNet. For data, a 50% collaborative saliency training set (iCoseg and Cosal2015) only containing class labels is used for pre-training, the last layer is modified into the number of classes of the training set before pre-training, and a classification network module for collaborative saliency class training has recognition capacity for classes of the data set, so that the classification network module has good class semantic distinguishing capacity, and in addition, preparation is made for obtaining good semantic consistency in the next training stage. Later, through debugging, when the learning rate, the training times and the batch size of the classification network module are respectively set to be 0.0002, 1000 and 8, the classification precision is better.
Step three: performing cooperative significance training on the generative confrontation network model of the whole dual-encoder by using a cooperative significance training set: using the pre-trained parameters in the step two for a generative confrontation network model of a double encoder; inputting a group of (5) images in a training set into a classification network module, extracting multi-scale group-level image semantic category features by the classification network module, and fusing the multi-scale group-level image semantic category features into inter-group significance features by a multi-scale semantic fusion module; and sequentially inputting each image of a group of images into a significance encoder to obtain a single significance characteristic, respectively carrying out pixel-level addition on the single significance characteristic and the inter-group significance characteristic to obtain a synergistic significance characteristic of each image, inputting the synergistic significance characteristic into a decoder to be decoded to obtain a detection image for generating each image, and judging by a discriminator.
Adjusting parameters: and (3) for the parameter adjustment of the generating type countermeasure network of the double encoders, starting the setting of training parameters, and except for carrying out random initialization processing on the parameters of the multi-scale fusion model, using the parameters after the first-stage training by other modules. Starting training, firstly, fixing parameters of a generator comprising two encoders, and optimizing and updating parameters of a discriminator; then, the parameters of the discriminator are fixed, and the parameters of the generator containing the two encoders are optimally updated. And repeating the two processes continuously to obtain the parameters of the final generation type countermeasure network of the double encoders. Note that in step two, the class labels of the collaborative significance training set are used for training the classification network; and in the third step, the truth label of the synergistic significance training set is used. Two different types of tags for the same dataset are used.
The generative confrontation network model of the invention inherits all the parameters and settings of loss functions (L1-loss and F-loss) of the pre-training of the step two, and then utilizes the dual-encoder generative confrontation network to carry out the cooperative and significant united confrontation training. Firstly, inputting a group of images (the number of the images is 5) into a pre-trained classification network module to obtain 4 group-level distinguishing and classifying semantic features with different scales, then splicing the 4 group-level distinguishing and classifying semantic features into 4 features, wherein the sizes of the 4 features are respectively 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024 and 7 × 7 × 204, then inputting the 4 features into a multi-scale semantic fusion module, and obtaining a feature with a uniform size, the size of the feature is 28 × 28 × 512, the feature has steady class semantic consistency as an inter-group significance feature through up-sampling, down-sampling, convolution and pixel-level addition in the multi-scale semantic fusion module, and the operation is specifically represented by the following formula:
Figure BDA0002859733240000081
wherein X ═ { X ═ X1,x2,…xNN-5 denotes the number of input images, fi(xj) Representing extracted images x from a classifying network modulejThe value range of j is N, namely {1,2,3,4,5} for the features under the ith scale; f. of1(xj)~f4(xj) Are 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024, and 7 × 7 × 204, respectively; f. ofi' (X) denotes the original i-th scale features obtained for a set of images; after treatment of f1(X)~f4The sizes of (X) are all 28 × 28 × 512; furthermore, Conv (. cndot.), Up(·)、Dn(. a) and
Figure BDA0002859733240000082
functions representing convolution operation, up-sampling, down-sampling and pixel-level addition, respectively; finter(X) represents the finally obtained inter-group significance signature.
The invention utilizes an inter-group semantic encoder to extract the characteristics of better inter-group image semantic consistency (namely the salient characteristics of inter-group images), adopts grouped cooperative salient images (5 images which are a group of related images) to be input into a pre-trained classification network module to obtain multi-scale group-level image semantic classification characteristics, and a multi-scale semantic fusion module fuses the multi-scale group-level image semantic classification characteristics to generate the characteristics with the inter-group image semantic consistency. Each group of the collaborative saliency images is adopted to be input into a saliency encoder in turn, and the saliency encoder inherits all parameters of an encoder of a pre-training saliency generation type confrontation network module, so that the network can better extract the single saliency characteristics of each collaborative saliency image.
Simultaneously, each image in a group of 5 images is sequentially and singly input into a pre-trained significance encoder, so that the significance characteristics of each image can be obtained, the pixel-level addition can be carried out on the significance characteristics of the 5 images and the inter-group significance characteristics to obtain the cooperative significance characteristics of each image, the cooperative significance characteristics are input into a decoder to carry out the joint countermeasure type training of the cooperative significance by a double-encoder generation countermeasure network, and the joint countermeasure type cooperative significance training is effectively carried out. The loss function of the dual-encoder generated countermeasure network is similar to the loss function of the significance generated countermeasure network module in the second step, and the loss function can be expressed as:
Figure BDA0002859733240000091
g in the formulaTSAnd DTSA generator and a discriminator respectively representing a generative confrontation network model of the dual encoder; gTS *And DTS *A generator and a discriminator which respectively represent parameters obtained after training; l isTA(. and L)TL1(. DEG) denotes the opposing F-loss function and L1 loss function of a two-stream encoder-generated opposing network, yjRepresenting an input image xjTrue value map of.
Step four: and detecting a generative confrontation network model obtained after the step three training by using the other part of the cooperative significance data set to realize cooperative significance detection.
Detecting the remaining 50% of the collaborative significance data sets (iCoseg and Cosal2015) of the class label step two training classification network module in the collaborative significance data set by using a significance encoder and a group-level semantic encoder finally obtained after training in the step three, wherein the formula can be expressed as follows:
Mco-saliency=GTS *(xj,X) (4)
wherein M isco-saliencyA graph representing the co-saliency results of the final test, X representing a set of related images, GTS *(. cndot.) represents the output of the generator of the trained dual encoder generative countermeasure network. The detection of a group of images with the same category of cooperative significance is equivalent to the task of completing one cooperative significance detection.
Hardware configuration of implementation of the invention: intel (R) Xeon 5-2650 v4@2.2Hz × 12 cores × 2CPU, NVIDIA TITAN RTX @24G × 8GPU, workstation of 512G memory was tested, the platform configuration of its software: ubuntu16.04, python3.65, pytorch 0.41.
In addition, to better demonstrate the performance and efficiency of the present invention, the present invention performs a subjective comparison and a temporal comparison of each test on the existing popular public data set Cosal 2015. One of the compared algorithms is 8, namely ESMG, CBCS, AUW, SACS-R, LDAW, GW and RCAN, of which 4 algorithms disclose codes, ESMG, CBCS, SACS, wherein, ESMG is from Image and Video Co-localization With Frank-Wolfe Algorithm, CBCS is from Cluster-Based Co-localization Detection, AUW is from A Universal metallic Learning-Based Framework for Co-localization Detection, SACS is from Self-adaptive Weighted Co-localization Detection, LDAW is from Co-localization Detection deletion and depletion parameter Constraint, SACS-R is from Self-adaptive Weighted Co-localization Detection Constraint, LDAW is from Co-localization Detection deletion and depletion parameter Constraint, GW is from Deep Group-boundary Detection Constraint Network Co-localization Detection function, and GW is from gradient-routing function, gradient-routing protection Co-localization Detection Constraint. In the invention, under the same hardware configuration, the comparison experiment of the detection time is carried out on the 4 published algorithms, and the result is shown in table 4.
Table 4 comparison of detection times of Cosal2015 data set with 4 popular algorithms
Comparison algorithm SACS-R SACS ESMG CBCS The invention
Code MATLAB MATLAB MATLAB MATLAB Python
One picture detecting time(s) 8.873 2.652 1.723 1.688 0.785
According to the experimental comparison results of fig. 2 and table 4, the effect of the synergistic significance prediction graph of the invention in fig. 2 is closer to the effect of the corresponding truth graph, and is obviously higher than the effect of the prediction graphs of other algorithms; in addition, in comparison with the detection time per image of the algorithm of the disclosed code in table 4, the detection time of the present invention needs to be the shortest. Thus, the performance and effect of the invention are better than those of other popular algorithms.
The invention adopts a confrontation type generation network of two streaming encoders based on a progressive training mode, fully utilizes the label of single significance data and the category label of cooperative significance in the pre-training of the first stage, and relieves the problem of insufficient data labels of an end-to-end model; on one hand, the existing single significant data set is fully used for pre-training the significance generating type antagonistic network module, so that better initialized parameters are provided for the significance generating type antagonistic network module, and the significance encoder has better capacity of extracting single significance characteristics; on the other hand, the class labels of the collaborative significance data sets are fully utilized to train the classification network modules in the network, so that the classification network modules have better class semantic recognition characteristics. In the second stage of cooperative significance training, the invention inherits all parameters learned in the previous stage, provides a multi-scale semantic fusion module to integrate multi-scale group-level semantic features of a classification network module to obtain an inter-group significance feature with steady class semantic consistency, obtains a single significance feature of each image by using a significance encoder, and then inputs the inter-group significance feature and the single significance feature into the encoder for joint countermeasure training. Meanwhile, the training and detection of the invention are simple, the model parameter is smaller, the universality is better, and the detection precision and efficiency are higher.
The method comprises the steps that an intergroup semantic encoder comprising a classification network module and a multi-scale semantic fusion module carries out progressive training, in the first stage, a classification network module is pre-trained by utilizing a collaborative saliency data set only comprising class labels, and the class training enables the classification network module to have the characteristic of recognizing the classes of the collaborative saliency data set (namely the class semantic features of a common target); and in the second stage, firstly, the grouped collaborative saliency images are input into a pre-trained classification module to obtain multi-scale group-level image category semantic features, then the multi-scale group-level image category semantic features are fused by using a multi-scale semantic fusion module, and finally, a feature with better inter-group image semantic consistency is generated, so that collaborative saliency training is carried out.
The problem of alleviating the cooperative significance signature sample deficiency comprises two aspects, which are determined according to two key characteristics of cooperative significance: one is the saliency feature of a single frame and the other is the saliency feature of an inter-group image. Therefore, on one hand, the significance generation type confrontation network module is pre-trained by utilizing the existing single significance data set, so that the significance encoder can extract better significance characteristics; on the other hand, the essence of the image with the co-saliency dataset is: each group of collaborative saliency images has the characteristic of the same category, a classification network module is pre-trained by using a collaborative saliency data set only containing category labels, so that the classification network module for collaborative saliency category training has the capability of identifying the category of the data set, a foundation is laid for extracting the semantic consistency of the images among groups in the next stage, and in the second stage of collaborative saliency training, a truth value icon label in the pre-training data of the classification network module is input to a semantic encoder among groups, so that the characteristics of better semantic consistency of the images among groups (the saliency characteristics of the images among groups) are extracted.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A collaborative visual saliency detection method based on a dual-encoder generation type confrontation network is characterized by comprising the following steps:
the method comprises the following steps: constructing a dual-encoder generation type confrontation network model: the dual-encoder generative confrontation network model comprises a generator and a discriminator, wherein the generator comprises two encoders and a decoder, the two encoders comprise a significance encoder and an inter-group semantic encoder, and the significance encoder, the decoder and the discriminator form a significance generative confrontation network;
step two: pre-training: on one hand, pre-training a significance generation type confrontation network module by using the existing single significance data set; on the other hand, the co-significant dataset is divided into two parts: training and testing sets, namely pre-training a classification network module of the inter-group semantic encoder by using class labels of the training sets to obtain pre-training parameters of a significance generation type confrontation network module and the classification network module;
step three: performing cooperative significance training on the dual-coder generated confrontation network model by using a training set of the cooperative significance data set: utilizing the pre-trained parameters in the step two as parameters for initializing a dual-encoder generated countermeasure network model; inputting a group of images in the collaborative significance data set into a classification network module, extracting multi-scale group-level image semantic category features by the classification network module, and fusing the multi-scale group-level image semantic category features into inter-group significance features by the multi-scale semantic fusion module; sequentially inputting each image of the grouped input images into a significance encoder to obtain a single significance characteristic, respectively carrying out pixel-level addition on the single significance characteristic and the inter-group significance characteristic to obtain a synergistic significance characteristic of each image, inputting the synergistic significance characteristic into a decoder to decode, generating a detection image of each image, and judging by a discriminator to form antagonistic training;
the loss function of the significance generation countermeasure network module is:
Figure FDA0003204871180000011
Figure FDA0003204871180000012
LL1(GS)=E(||GS(x)-y||1)
wherein G isSAnd DSGenerators and discriminators, G, representing significance generating countermeasure network modules, respectivelySSum DSRepresenting parameters obtained after the generator and the discriminator are trained; dS(. and G)S(. cndot.) respectively represents the output of the generator of the significance generating confrontation network module and the output of the discriminator, x and y respectively represent the input image and the corresponding true value graph, and F (-) and E (-) respectively represent the F-loss function and the BCE-loss function; l isA(. and L)L1(. cndot.) denotes the opposing F-loss function and the L1 loss function, respectively; the magnitude of the coefficient lambda of L1-loss is 100,
Figure FDA0003204871180000013
representing a splice operator along the path; real and Fake, respectivelyA label matrix representing all 1 s and all 0 s; i | · | purple wind1Is represented by1A norm;
step four: and detecting the generated confrontation network model obtained after the step three training by utilizing the test set of the cooperative significance data set to realize cooperative significance detection.
2. The collaborative visual saliency detection method based on dual-encoder generative countermeasure network of claim 1, characterized in that the generative countermeasure network model in the first step is based on a full convolution network composition, and the whole saliency generative countermeasure network adopts a structure of U-Net; the significance encoder and the decoder of the significance generation type countermeasure network form a U-Net structure, more image information is obtained in a short connection mode, a generator has 17 layers of full convolution in total, the significance encoder comprises 8 layers of full convolution, and the discriminator comprises 9 layers of full convolution; the discriminator adopts a patch-level discrimination structure, has 5 full convolution layers in total, converts the whole image into a size of 28 multiplied by 28, and compares the size of the whole image with each element on a 28 multiplied by 28 label matrix to be less; the generator uses F-loss and l1Norm loss function, and the discriminator uses F-loss function.
3. The collaborative visual saliency detection method based on dual-encoder generative confrontation network of claim 1 or 2, characterized in that the inter-group semantic encoder is composed of a classification module and a multi-scale semantic fusion module, wherein the classification network module adopts a Resnet50 model trained in ImageNet, in pre-training, the last layer of fully-connected layer is changed into the number of classes of the pre-training data set, and is optimized by BCE-loss; the structure of the multi-scale semantic fusion module is as follows: respectively splicing four features 56 × 56 × 256 × 5, 28 × 28 × 512 × 5, 14 × 14 × 1024 × 5 and 7 × 7 × 2048 × 5 of different scales output by the classification network module as input to obtain four features 56 × 56 × 1280, 28 × 28 × 2560, 14 × 14 × 5120 and 7 × 7 × 10240 of different scales; then convolution operation is respectively carried out to unify the number of channels, and four characteristics with different scales of 56 multiplied by 256, 28 multiplied by 256, 14 multiplied by 256 and 7 multiplied by 256 are obtained; then respectively carrying out down-sampling operation to obtain four characteristics of 28 multiplied by 512; finally, four 28 × 28 × 512 features are added at the pixel level to obtain a 28 × 28 × 512 feature.
4. The collaborative visual saliency detection method based on dual-encoder generative confrontation network of claim 3, characterized in that the pre-training in the second step is, on one hand, for the saliency-generative confrontation network module, the training data set is the data set disclosing the popular saliency; in the countermeasure training process of the significance generation type countermeasure network module, firstly fixing a generator, and training a discriminator so as to update parameters of the discriminator; then fixing a discriminator to train the generator, so that the parameters of the generator are updated, repeating the process, continuously and circularly optimizing the generator and the discriminator, and finally determining the model parameters of the significance generation type countermeasure network; and for the classification network module, the pre-trained labels are class labels of a training set, and the model parameters of the classification network module are determined after pre-training.
5. The method for detecting collaborative visual saliency based on dual-encoder generative confrontation network as claimed in claim 4, wherein the generative confrontation network model in step three inherits all parameters and settings of loss function of the pre-training in step two, and then performs collaborative saliency joint confrontation training using the generative confrontation network model of dual-encoder: firstly, inputting a group of 5 images into a pre-trained classification network module to obtain 4 group-level classification semantic features with different scales, then splicing the 4 group-level classification semantic features into 4 features with the sizes of 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024 and 7 × 7 × 204, inputting the 4 features into a multi-scale semantic fusion module, obtaining a feature with a uniform size of 28 × 28 × 512 by up-sampling, down-sampling, convolution and pixel-level addition in the multi-scale semantic fusion module, wherein the feature has a steady class semantic consistency as an inter-group significance feature, and the method specifically operates as the following formula:
Figure FDA0003204871180000031
f1(X)=Dn(Conv(f′1(X)))
f2(X)=Conv(f′2(X))
f3(X)=Up(Conv(f′3(X)))
f4(X)=Up(Conv(f′4(X)))
Figure FDA0003204871180000032
wherein X ═ { X ═ X1,x2,…xNN-5 denotes the number of input images, fi(xj) Representing extracted images x from a classifying network modulejThe value range of j is N, namely {1,2,3,4,5} for the features under the ith scale; f. of1(xj)~f4(xj) Are 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024, and 7 × 7 × 204, respectively; f. ofi' (X) denotes the features at the original i-th scale obtained for a set of images; after treatment of f1(X)~f4The sizes of (X) are all 28 × 28 × 512; furthermore, Conv (. cndot.), Up(·)、Dn(. a) and
Figure FDA0003204871180000035
functions representing convolution operation, up-sampling, down-sampling and pixel-level addition, respectively; finter(X) represents the finally obtained inter-group significance signature.
6. The method as claimed in claim 5, wherein in the third step, each image in a group of 5 images is sequentially input to a pre-trained saliency encoder, so that saliency characteristics of each image can be obtained, such that pixel-level addition can be performed on the saliency characteristics between the groups of the 5 images to obtain the co-saliency characteristics of each image, and the co-saliency characteristics are input to a decoder for joint antagonistic training of a model of the dual-encoder generative antagonistic network, and a loss function of the dual-encoder generative antagonistic network is:
Figure FDA0003204871180000033
Figure FDA0003204871180000034
LTL1(GTS)=E(||GTS(xj,X)-yj||1)
wherein G isTSAnd DTSA generator and a discriminator respectively representing a dual-encoder generative confrontation network model; gTS *And DTS *Parameters of the generator and the discriminator after training are respectively represented; l isTA(. and L)TL1(. h) opposing F-loss function and L1 loss function representing dual encoder generated opposing network, yjRepresenting an input image xjTrue value map of.
7. The collaborative visual saliency detection method based on dual-encoder generated confrontation network of claim 6 characterized in that in said fourth step, the saliency encoder and the group-level semantic encoder finally obtained after the training in the third step are used to detect the remaining 50% collaborative saliency dataset of the classification network module trained in the second step of class labeling in the collaborative saliency dataset, and the formula is:
Mco-saliency=GTS *(xj,X);
wherein M isco-saliencyA graph representing the co-saliency results of the final test, X representing a set of related images, GTS *Representing a trained dual-coder generative countermeasure networkThe output of the generator.
CN202011558989.XA 2020-12-25 2020-12-25 Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network Active CN112651940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011558989.XA CN112651940B (en) 2020-12-25 2020-12-25 Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011558989.XA CN112651940B (en) 2020-12-25 2020-12-25 Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network

Publications (2)

Publication Number Publication Date
CN112651940A CN112651940A (en) 2021-04-13
CN112651940B true CN112651940B (en) 2021-09-17

Family

ID=75362887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011558989.XA Active CN112651940B (en) 2020-12-25 2020-12-25 Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network

Country Status (1)

Country Link
CN (1) CN112651940B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780241B (en) * 2021-09-29 2024-02-06 北京航空航天大学 Acceleration method and device for detecting remarkable object
CN114743027B (en) * 2022-04-11 2023-01-31 郑州轻工业大学 Weak supervision learning-guided cooperative significance detection method
CN115331012B (en) * 2022-10-14 2023-03-24 山东建筑大学 Joint generation type image instance segmentation method and system based on zero sample learning
CN116994006B (en) * 2023-09-27 2023-12-08 江苏源驶科技有限公司 Collaborative saliency detection method and system for fusing image saliency information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829391A (en) * 2019-01-10 2019-05-31 哈尔滨工业大学 Conspicuousness object detection method based on concatenated convolutional network and confrontation study
US10395403B1 (en) * 2014-12-22 2019-08-27 Altia Systems, Inc. Cylindrical panorama
CN111027576A (en) * 2019-12-26 2020-04-17 郑州轻工业大学 Cooperative significance detection method based on cooperative significance generation type countermeasure network
US10650238B2 (en) * 2018-03-28 2020-05-12 Boohma Technologies Llc Opportunity to view an object in image processing

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845471A (en) * 2017-02-20 2017-06-13 深圳市唯特视科技有限公司 A kind of vision significance Forecasting Methodology based on generation confrontation network
US20190108448A1 (en) * 2017-10-09 2019-04-11 VAIX Limited Artificial intelligence framework
US10664999B2 (en) * 2018-02-15 2020-05-26 Adobe Inc. Saliency prediction for a mobile user interface
CN110310343B (en) * 2019-05-28 2023-10-03 西安万像电子科技有限公司 Image processing method and device
CN110689599B (en) * 2019-09-10 2023-05-19 上海大学 3D visual saliency prediction method based on non-local enhancement generation countermeasure network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10395403B1 (en) * 2014-12-22 2019-08-27 Altia Systems, Inc. Cylindrical panorama
US10650238B2 (en) * 2018-03-28 2020-05-12 Boohma Technologies Llc Opportunity to view an object in image processing
CN109829391A (en) * 2019-01-10 2019-05-31 哈尔滨工业大学 Conspicuousness object detection method based on concatenated convolutional network and confrontation study
CN111027576A (en) * 2019-12-26 2020-04-17 郑州轻工业大学 Cooperative significance detection method based on cooperative significance generation type countermeasure network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Deep Co-Saliency Detection via Stacked Autoencoder-Enabled Fusion and Self-Trained CNNs;Chung-Chi Tsai 等;《IEEE TRANSACTIONS ON MULTIMEDIA》;20200430;第22卷(第4期);1016-1031 *
Detecting Robust Co-Saliency with Recurrent Co-Attention Neural Network;Bo Li 等;《Proceedings of the International Joint Conference onArtificial Intelligence》;20190813;818-825 *
Robust Deep Co-Saliency Detection With Group Semantic and Pyramid Attention;ZhengJun Zha等;《IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS》;20200731;第31卷(第7期);2398-2408,正文摘要,附图2 *
基于深度卷积生成对抗网络的图像识别算法;刘恋秋 等;《液晶与显示》;20200430;第35卷(第4期);383-388 *

Also Published As

Publication number Publication date
CN112651940A (en) 2021-04-13

Similar Documents

Publication Publication Date Title
Liu et al. Connecting image denoising and high-level vision tasks via deep learning
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN112966127A (en) Cross-modal retrieval method based on multilayer semantic alignment
CN110866140A (en) Image feature extraction model training method, image searching method and computer equipment
CN112084331A (en) Text processing method, text processing device, model training method, model training device, computer equipment and storage medium
CN112417097B (en) Multi-modal data feature extraction and association method for public opinion analysis
CN111027576B (en) Cooperative significance detection method based on cooperative significance generation type countermeasure network
CN109783666A (en) A kind of image scene map generation method based on iteration fining
CN112100346A (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN109033107A (en) Image search method and device, computer equipment and storage medium
CN110175248B (en) Face image retrieval method and device based on deep learning and Hash coding
Sardar et al. Iris segmentation using interactive deep learning
CN111985538A (en) Small sample picture classification model and method based on semantic auxiliary attention mechanism
Gao et al. Self-attention driven adversarial similarity learning network
CN112733768A (en) Natural scene text recognition method and device based on bidirectional characteristic language model
CN114863407B (en) Multi-task cold start target detection method based on visual language deep fusion
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
CN114936623A (en) Multi-modal data fused aspect-level emotion analysis method
CN111897954A (en) User comment aspect mining system, method and storage medium
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
Wu et al. Multiple attention encoded cascade R-CNN for scene text detection
Lin et al. Task-oriented feature-fused network with multivariate dataset for joint face analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant