WO2020108165A1 - 图像描述信息生成方法和装置及电子装置 - Google Patents

图像描述信息生成方法和装置及电子装置 Download PDF

Info

Publication number
WO2020108165A1
WO2020108165A1 PCT/CN2019/111946 CN2019111946W WO2020108165A1 WO 2020108165 A1 WO2020108165 A1 WO 2020108165A1 CN 2019111946 W CN2019111946 W CN 2019111946W WO 2020108165 A1 WO2020108165 A1 WO 2020108165A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
description information
image
sample
image description
Prior art date
Application number
PCT/CN2019/111946
Other languages
English (en)
French (fr)
Inventor
陈宸
牟帅
肖万鹏
鞠奇
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19891662.9A priority Critical patent/EP3889836A4/en
Publication of WO2020108165A1 publication Critical patent/WO2020108165A1/zh
Priority to US17/082,002 priority patent/US11783199B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present application relates to the field of computers, and in particular, to a method and device for generating image description information and an electronic device.
  • CNN Convolutional Neural Networks
  • Encoder-Decoder Encoder-Decoder
  • RNN Recurrent Neural Networks
  • the image description information generated by the above structure can express the content in the image
  • the quality of the sentences used in the image description information cannot be guaranteed, such as poor smoothness or inconsistency with daily oral expression habits.
  • the image description information generation method provided by the related art has a problem of poor generation quality.
  • Embodiments of the present application provide an image description information generation method and device, and an electronic device, so as to at least solve the technical problem of poor generation quality of the image description information generation method provided by the related art.
  • a method for generating image description information including: acquiring a target image to be processed; inputting the target image into a target image description information generation network, wherein the target image description information generation network is A generation network used to generate image description information after performing adversarial training using multiple sample images.
  • the above-mentioned adversarial training is based on an initial image description information generation network that matches the target image description information generation network, and initialization
  • the discriminant network is used for alternating training.
  • the discriminant network is used to discriminate the output result of the image description information generation network; based on the output result of the target image description information generation network, the target image description information used to describe the target image is generated .
  • an image description information generating apparatus including: an obtaining unit for obtaining a target image to be processed; an input unit for inputting the above target image into the target image description information to generate A network in which the above target image description information generation network is a generation network for generating image description information obtained after performing adversarial training using multiple sample images.
  • the above adversarial training is based on a comparison with the above target image description information generation network Matching the initial image description information generation network and the initial discrimination network, and alternating training, the discrimination network is used to discriminate the output result of the image description information generation network; the generation unit is used to generate based on the target image description information The output result of the network generates target image description information for describing the above target image.
  • the above-mentioned training unit further includes: a determination module for adjusting the above-mentioned current image description information generation network according to the above-mentioned sample discrimination probability value to obtain a trained image description information generation network, and according to the above-mentioned training After the image description information generation network adjusts the current discriminant network to determine the sample discrimination probability value output by the current discriminant network before obtaining the trained discriminant network; an acquisition module is used to obtain the sample image description generation information and The first matching degree between the sample images, wherein the language model includes one or more parameters for evaluating the description information of the sample images; a weighted average processing module is used to determine the probability value and the above for the sample The first matching degree is subjected to weighted average processing to obtain the above sample feedback coefficient.
  • the training unit implements the following steps to adjust the current image description information generation network according to the sample discrimination probability value.
  • Obtaining the trained image description information generation network includes: adjusting the above according to the sample discrimination probability value.
  • the current image description information generates parameters in at least one of the following structures in the network: the current region-based convolutional neural network, the current attention serialized language model, and the current two-layer long-term and short-term memory network.
  • the above-mentioned training unit realizes the above-mentioned adjustment of the current discriminating network based on the above-mentioned trained image description information generating network through the following steps, and obtaining the trained discriminating network includes: obtaining the above-mentioned trained image description information generating network The output of the training sample image description generation information, or the training sample image reference description information; using the above sample image description information, the training sample image description generation information or the training sample image reference description information, Adjust the parameters in the structure of the convolutional neural network in the current discriminant network to obtain the discriminative network after training.
  • the above-mentioned training unit implements the above-mentioned adjustment of the above-mentioned current discriminating network based on the above-mentioned training image generation information generating network to adjust the above-mentioned current discriminating network.
  • the training sample image description generation information output by the network, or the training sample image reference description information; using the above sample image description information, the above training sample image description generation information or In the reference information of the sample image after training, adjust the parameters in the structure of the recurrent neural network in the current discriminant network to obtain the discriminant network after training.
  • an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the above through a computer program Image description information generation method.
  • the target image description information generation network After acquiring the target image to be processed, it is input to a target image description information generation network obtained through adversarial training, and the target image description information generation network is used to generate a match with the above target image
  • the target image description information instead of using the CNN-RNN structure provided by the related technology to generate image description information of an image, the target image description information obtained based on adversarial training is used to generate a network.
  • a discriminant network is introduced to discriminate the output of the image description information generation network, and the two are alternately trained, so that the final generated target image description information generation network is reinforced learning, so as to realize the use of target images
  • the evaluation index of the image description information generated by the description information generation network is comprehensively optimized to further improve the generation quality of the image description information, and overcomes the technical problem of poor generation quality of the image description information generation method provided by the related art.
  • FIG. 1 is a schematic diagram of a hardware environment of an optional image description information generation method according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of an optional image description information generation method according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an optional image description information generation method according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of another optional image description information generation method according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of yet another alternative image description information generation method according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of yet another optional image description information generation method according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of yet another optional image description information generation method according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of yet another alternative image description information generation method according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of yet another optional image description information generation method according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of an evaluation index of an optional image description information generation method according to an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the effect of an optional image description information generation method according to an embodiment of the present application.
  • FIG. 12 is a schematic diagram of the effect of another optional image description information generation method according to an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of an optional image description information generating device according to an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of another optional image description information generating device according to an embodiment of the present application.
  • 15 is a schematic structural diagram of an optional electronic device according to an embodiment of the present application.
  • CNN Convolutional Neural Networks (Convolutional Neural Networks), used to extract image features in images.
  • RNN Recurrent Neural Networks (Recurrent Neural Networks), used for language modeling, learning context features.
  • R-CNN Region-based CNN (Region-based CNN), used for target detection and positioning.
  • RPN Regional Proposal Networks, a module in faster R-CNN, used to extract feature vectors of frames where objects may exist.
  • LSTM Long-Short-Term Memory Networks (Long-Short Term Memory), which can learn the relationship for a long time and is the most widely used RNN.
  • CNN-RNN structure CNN is used as an encoder, RNN is used as a decoder, and a general framework for image description algorithms.
  • Attention mechanism Attention mechanism, weighted calculation of input features in RNN modeling.
  • policy gradient A policy gradient, a method in reinforcement learning that directly learns each update strategy.
  • GANs Generative Adversary Networks (Generative Adversary Nets), a game-generated network without the need to pre-set the sample probability distribution.
  • generator is a generator in the generation confrontation network.
  • discrimintor is a discriminator in the generation of confrontation networks.
  • BLEU Bilingual mutual translation quality evaluation aid tool (Bilingual Evaluation Understudy), mainly used for the quality evaluation of machine translation.
  • METEOR A quality evaluation standard for translation in any language.
  • CIDEr Quality evaluation standard (Consensus-based image description) for picture description.
  • SPICE Semantic-based image description quality evaluation standard (Semantic Propositional Image Caption Evaluation).
  • MOCOCO Microsoft Common Objects Context data set, used for key point detection, target detection, picture description, etc.
  • Genome A data set with densely labeled images.
  • MLE Maximized Likelihood Estimation (Maximize Likelihood Estimation), used to estimate the parameters of a probability model, is a training method of RNN.
  • a method for generating image description information is provided.
  • the method for generating image description information may be, but not limited to, as shown in FIG. 1 In the hardware environment.
  • the user device 102 acquires a target image to be processed, where the target image includes a character object A and a wall object B.
  • the memory 104 in the user device 102 stores the target image, and the processor 106 sends the target image to the server 110 through the network 108, as in steps S104-S106.
  • the server 110 executes step S108 through the processing engine 114: in step S1082, the received target image is input into the above target image description information generation network, and in step S1084, target image description information for describing the target image is generated.
  • the above target image description information generation network is a generation network for generating image description information obtained after performing adversarial training using a plurality of sample images obtained from the database 112, the adversarial training is based on the target image description generation
  • the information network matches the initial image description information generating network and the initial discriminating network, and the alternating training is performed.
  • the above discriminating network is used to discriminate the output result of the image description information generating network.
  • the above target image description information may be “the character object A has crossed the wall object B”.
  • the server 110 sends the generated target image description information to the user device 102 through the network 108 for display, as shown in steps S112-S114.
  • the image description information generation method after acquiring the target image to be processed, it is input into the target image description information generation network obtained through adversarial training, and the target image description is used Information generation network to generate target image description information matching the above target image, wherein the adversarial training is based on the initial image description information generation network matching the target image description generation information network, and the initial discriminant network, and Alternate training. That is to say, instead of using the CNN-RNN structure provided by the related technology to generate image description information of an image, the target image description information obtained based on adversarial training is used to generate a network.
  • a discriminant network is introduced to discriminate the output of the image description information generation network, and the two are alternately trained so that the final generated target image description information generation network can be intensively learned to realize the use of target images
  • the evaluation index of the image description information generated by the description information generation network is comprehensively optimized, thereby improving the generation quality of the image description information and overcoming the problem of poor generation quality of related technologies.
  • the above image description information generation method may be, but not limited to, applied to a terminal device having functions such as image acquisition, image recognition, or image processing.
  • the above terminal device may be user equipment, such as a mobile phone, a tablet computer, a notebook computer, a PC and other terminals, or a server, such as a data processing server, a distributed processing server, and so on.
  • the above image description information generation method can be completed on an independent terminal device, that is, the terminal device directly obtains the target image to be processed, and uses the target image description information generation network to generate the target image description information of the target image, thereby reducing The problem of generation delay caused by data transmission to achieve the effect of improving generation efficiency.
  • the above image description information generation method can also be completed by data interaction on at least two terminal devices.
  • the target image to be processed is obtained in the user device 102, and then the target image is passed through the network 108 is sent to the server 112, and the target image description information in the server generates a network to generate the target image description information of the target image, and then returns the generated target image description information to the user device 102, thereby implementing the image description information through data interaction
  • the aforementioned network 108 may include but is not limited to a wireless network or a wired network.
  • the wireless network includes: Bluetooth, WIFI and other networks that realize wireless communication.
  • the aforementioned wired network may include, but is not limited to, a wide area network, a metropolitan area network, and a local area network.
  • the above method for generating image description information includes:
  • the target image description information generation network is a generation network for generating image description information obtained after performing adversarial training using multiple sample images, and the adversarial training is based on The initial image description information generation network that matches the target image description information generation network, and the initial discriminant network, and the alternating training, the discriminant network is used to discriminate the output results of the image description information generation network;
  • S206 Generate target image description information for describing the target image according to the output result of the target image description information generating network.
  • the above-mentioned image description information generation method may be, but not limited to, applied to image recognition scenes, image retrieval scenes, image verification scenes, etc. It is necessary to obtain an image description that matches the image content presented in the image Information scene.
  • the target image is input to the target image description information generation network obtained through adversarial training, and the target image description information generation network is used to generate the Target image description information that matches the target image. Further, information verification is performed on the target image description information with improved generation quality to determine whether the target image passes verification, thereby ensuring the accuracy of image verification.
  • the above scenario is only an example, which is not limited in this embodiment.
  • the target image is input to a target image description information generation network to generate target image description information matching the target image, wherein the target image description information generation network Is a generation network for generating image description information obtained after adversarial training using the newly introduced discriminant network.
  • the target image description information generated by the target image description information generation network may be as follows: "character object A", "over”, "wall object B". This is only an example, and there is no limitation on this in this embodiment.
  • the method before acquiring the target image to be processed, the method further includes: constructing an initial image description information generating network and an initial discrimination network; and generating an initial image description information generating network and an initial discrimination network Perform adversarial training to get the target image description information generation network.
  • the network training framework for adversarial training constructed in this embodiment may be, but not limited to, as shown in FIG. 4, the sample images are sequentially input into the network training framework, and the image description information generation network G will Generate the sample image description generation information corresponding to the sample image, and send the sample image description generation information to the discrimination network D for discrimination, and send it to the language model Q to obtain the corresponding evaluation score.
  • the evaluation score s of the language model Q is used to obtain the feedback coefficient r for adjusting the image description information generation network G, so as to realize the training and optimization of the image description information generation network according to r, and further use the training optimized image description information generation network , To train the optimal discriminant network D, and so on to alternately train the image description information generating network G and the discriminating network D, so as to obtain the final convergent target image description information generating network.
  • the above language model may include, but is not limited to, one or more index parameters for evaluating the generation quality of the image description generation information, such as BLEU, ROUGE, METEOR, CIDEr, SPICE, etc.
  • index parameters for evaluating the generation quality of the image description generation information such as BLEU, ROUGE, METEOR, CIDEr, SPICE, etc.
  • the above parameters are correlated with human subjective judgment of the image description generation information. Therefore, the comprehensive evaluation scores of the above parameters can achieve the effect of objectively reflecting the generation quality of the image description generation information.
  • the initial image description information generating network constructed in this embodiment may include but is not limited to: a convolutional neural network CNN, an attention serialized language model Attention, and a recurrent neural network RNN.
  • CNN is used to extract image features in an image
  • Attention is a mechanism for updating weights in a serialized language model
  • RNN is used to learn context features.
  • the sample image is image I
  • the corresponding sample image description information is x 1:T .
  • the image I is input to the CNN, and the CNN extracts the local feature vector of the image I, for example, ⁇ v 1 ,v 2 ,...,v k
  • k ⁇ 10,11,12,...,100 ⁇ , and the global feature vector Input the local feature vector into Attention to get the weighted average processing among them, Related to time t.
  • Input RNN and input x 1:T into the RNN through the word embedding matrix Embedding.
  • the word embedding matrix Embedding is a model for linear transformation.
  • the above-mentioned convolutional neural network CNN may be, but not limited to, an improved version of the region-based convolutional neural network R-CNN (ie, Faster R-CNN).
  • the backbone network is Resnet101, which can be MSCOCO and Genome data sets are pre-trained.
  • the above attention serialization language model Attention adopts a soft attention (soft attention) strategy, and performs weighted average processing on the image vector of each image.
  • the above-mentioned recurrent neural network RNN may be, but not limited to, a double-layer long-short-term memory network LSTM structure.
  • the initial discrimination network constructed in this embodiment may include, but is not limited to, one of the following:
  • the feature vector output from the convolutional neural network will be input into the first multi-layer perceptron (Multi-Layer Perception, MLP for short) and the first classification network (such as softmax) to be converted into use It indicates the probability value of the discrimination result.
  • MLP Multi-Layer Perception
  • the above-mentioned convolutional neural network may include, but is not limited to, an M-layer convolution kernel.
  • the ith-layer convolution kernel in the M-layer convolution kernel is used to perform a convolution operation on the sample image vector of the sample image according to the i-th size.
  • Said i is a positive integer less than or equal to M.
  • the second initialization discriminant network based on the structure of the recurrent neural network.
  • the feature vector output from the recurrent neural network is input into the second multi-layer perceptron (MLP) and the second classification network (such as softmax), which is converted to Indicates the probability value of the discrimination result.
  • MLP multi-layer perceptron
  • softmax the second classification network
  • the above-mentioned recurrent neural network may include but not limited to standard N-layer LSTM.
  • the image description information generating network G will generate image description generating information corresponding to the image, and send the image description generating information to the discriminating network D for discriminating, sending Give the language model Q to obtain the corresponding evaluation score, and then, according to the judgment result p of the judgment network D and the evaluation score s of the language model Q, obtain the feedback coefficient r for adjusting the image description information generation network G, so as to achieve Train and optimize the image description information generation network according to r.
  • the above calculation method of the feedback coefficient r may include but is not limited to:
  • is the weighted average coefficient.
  • the target image is input to the image description information generating network to be trained and the feedback coefficient r obtained after discriminating the network, and the image description information generating network is adjusted and optimized according to the value of the feedback coefficient r Then, the output of the adjusted image description information generation network is further used to adjust and optimize the discriminant network, and finally convergence is achieved through alternating training, thereby obtaining the target image description information generation network.
  • a target image description information generation network obtained through adversarial training using the network training framework shown in the above figures is used to learn the target image to be processed to generate a match with the target image
  • the target image description information that has been improved and optimized has been achieved, thereby achieving the purpose of improving the generation quality of the image description information.
  • a network is generated using target image description information obtained based on adversarial training.
  • a discriminant network is introduced to discriminate the output of the image description information generation network, and the two are alternately trained so that the final generated target image description information generation network can be intensively learned to realize the use of target images
  • the evaluation index of the image description information generated by the description information generation network is comprehensively optimized, thereby improving the generation quality of the image description information.
  • the method before acquiring the target image to be processed, the method further includes:
  • S1 construct an initial image description information generation network and an initial discrimination network
  • the initial image description information generation network and an initial discrimination network before acquiring the target image to be processed, it is necessary to construct an initial image description information generation network and an initial discrimination network. Then, it is necessary to perform pre-training on the initial image description information generating network and the initial discriminant network, and then perform adversarial training on the pre-trained image description information generating network and discriminant network.
  • the initial image description information generation network may be, but not limited to, a region-based convolutional neural network, attention serialized language model, and a recurrent neural network with a double-layer long-short-term memory network.
  • the frame of the initial image description information generation network constructed can refer to the image description information generation network G shown in FIG. 5.
  • the initial discriminant network may include, but is not limited to: CNN type discriminant network and RNN type discriminant network.
  • the CNN-type network may be, but not limited to, the first initialization discriminant network constructed based on the convolutional neural network, the first multilayer perceptron, and the first classification network
  • the RNN-type discriminant network may be, but not limited to, based on the recurrent neural network
  • the second initialization discriminant network constructed by the second multilayer perceptron and the second classification network.
  • the two are pre-trained, and the steps may be as follows:
  • Use G ⁇ to generate the pre-training set S D where, Then, on S D D 0 to obtain a pre-training D ⁇ .
  • ⁇ and ⁇ are parameters confirmed by training in the image description information generating network G and discriminating network D, respectively.
  • pre-training G ⁇ and the pre-training D ⁇ are used to start alternating training to achieve the adversarial training of the two neural networks, thereby achieving the goal of optimizing the quality of the image description information generation network G.
  • the construction of the initial discrimination network includes:
  • the convolutional neural network includes: M-layer convolution kernel.
  • the ith-layer convolution kernel in the M-layer convolution kernel is used to perform a convolution operation on the sample image vector of the sample image according to the i-th size. Is a positive integer less than or equal to M, the sample image vector is determined according to the image feature vector of the sample image and the word feature vector included in the sample image description information corresponding to the sample image;
  • the multi-layer perceptron MLP can be, but is not limited to, a neural network structure with forward triggering.
  • the nodes of two adjacent layers are fully connected, there is no connection between nodes of the same layer, and no connection between cross-layers.
  • the first initialization discriminant network includes a convolutional neural network structure with M layers of convolution kernels, a first multi-layer perceptron (MLP) and a first classification network (such as softmax).
  • MLP multi-layer perceptron
  • first classification network such as softmax
  • each layer of convolution kernels in the M-layer convolution kernel is used to indicate a size used for convolution operation.
  • the i-th convolution kernel is the i-th size for convolution
  • n i the corresponding number of convolution kernels
  • the first MLP and the first classification network (such as softmax) are used to convert the output result of the M-layer convolution kernel to obtain a probability value for indicating the discrimination result.
  • the sample image is the image I
  • the sample image description information x 1:T corresponding to the image I.
  • Image I will get a d-dimensional image feature vector through CNN
  • the above sample image description information x 1:T input words are embedded in the matrix Embedding to obtain T d-dimensional word feature vectors.
  • T+1 feature vectors cascade the above T+1 feature vectors to obtain a feature matrix:
  • the M-layer convolution kernels have M different sizes, and there are n i convolution kernels of the i- th size.
  • the above parameters W T , b T , ⁇ , W H and b H are parameters to be determined during the training process.
  • the structure of the recurrent neural network includes: N layers of long-term and short-term memory networks, and N is determined according to the sample image vector of the sample image. The word feature vector is determined.
  • the second initialization discrimination network includes a recurrent neural network with N-layer LSTM, a second multi-layer perceptron (Multi-Layer Perception, MLP for short) and a second classification network (such as softmax).
  • MLP Multi-Layer Perception
  • MLP multi-Layer Perception
  • softmax a second classification network
  • the second MLP and the second classification network softmax are used to convert the output result of the above N-layer LSTM to obtain a probability value used to indicate the discrimination result.
  • the sample image is the image I
  • the sample image description information x 1:T corresponding to the image I.
  • Image I will get a d-dimensional image feature vector through CNN LSTM as input a first layer, after which each layer LSTM respectively input to the sample image description information x 1: T a corresponding word feature vectors to obtain a corresponding hidden vector h i.
  • the above parameters W R , b R , ⁇ are the parameters to be determined during the training process.
  • adversarial training is carried out by introducing a discriminant network and an image description information generation network to improve the generation quality of the image description information generation network, wherein the above discriminant network provides two construction structures in this embodiment , Based on CNN structure of convolutional neural network and RNN structure based on recurrent neural network, respectively.
  • the discriminating network with different structures will make the adversarial training process more diversified and help to improve the training effect.
  • building an initial image description information generating network includes:
  • S1 use the region-based convolutional neural network, attention serialization language model and double-layer long-short-term memory network to build an initial image description information generation network, where the region-based convolutional neural network is used to extract from the sample image Local feature vectors and global feature vectors; attention serialization language model is used to perform weighted average processing on local feature vectors to obtain average feature vectors; double-layer long-short-term memory network is used to obtain the objects to be discriminated by using average feature vectors and global feature vectors Vector, and input the object vector to be discriminated into the initial discriminant network.
  • the RNN may, but is not limited to, use a top-down model.
  • the model uses a double-layer long-short-term memory network LSTM, which crosses input and output during training.
  • the object vectors to be discriminated may include, but are not limited to, hidden vectors output by the double-layer long-short-term memory network LSTM
  • Image I is input into Faster R-CNN, and Faster R-CNN extracts the local feature vector of the image I, for example, ⁇ v 1 , v 2 ,..., v k
  • k ⁇ 10,11,12,...,100 ⁇ , Global feature vector Input the local feature vector into Soft Attention to obtain weighted average processing among them, Related to time t. will Input the first layer LSTM1 in the RNN, and input x 1:T through the word embedding matrix Embedding into the first layer LSTM1 in the RNN.
  • the word embedding matrix Embedding is a model for linear transformation.
  • an initial image description information generation network is constructed by using a region-based convolutional neural network, an attention serialization language model, and a double-layer long-short-term memory network, and the network is generated based on the initial image description information.
  • the introduction of discriminative networks for alternating training will help optimize and improve the image description information generation network, thereby overcoming the problem of poor quality of image description information generated based on CNN-RNN structure in related technologies.
  • adversarial training is performed on the initial image description information generation network and the initial discrimination network to obtain the target image description information generation network including:
  • S18 Determine the sample description information to be discriminated from the sample image description information, the sample image description generation information, or the sample image reference description information;
  • the current image description information generation network is adjusted according to the sample discrimination probability value to obtain the trained image description information generation network, and according to the trained image description information
  • the generating network adjusts the current discriminating network to obtain the trained discriminating network; the trained image description information generating network is used as the current image description information generating network, and the trained discriminating network is used as the current discriminating network; the sample feedback coefficient indicates sample discrimination When the probability value has reached the convergence condition, the current image description information generation network is used as the target image description information generation network.
  • the specific description will be made in conjunction with the example shown in FIG. 9. Assume that the acquired sample image is image I, and the corresponding sample image description information is x 1:T .
  • the network framework of the current image information generating network and the current discriminating network takes the framework constructed in the above example as an example.
  • Input image I into the current image information to generate Faster R-CNN in the network Faster R-CNN extracts the local feature vector of the image I, for example, ⁇ v 1 , v 2 ,..., v k
  • k ⁇ 10,11, 12,...,100 ⁇ , and the global feature vector Input the local feature vector into Soft Attention to obtain weighted average processing among them, Related to time t.
  • the global feature vector that will be the image feature vector of image I Enter the double-layer LSTM and discriminant network D respectively.
  • the above image feature vector and word feature vector constitute an image vector for identifying the features of image I.
  • the current discriminant network D will obtain positive samples ⁇ (I,x 1:T ) ⁇ , and negative samples: ⁇ (I,y 1:T ) ⁇ with Among them, the positive sample ⁇ (I,x 1:T ) ⁇ is obtained based on image I and sample image description information x 1:T ; the negative sample ⁇ (I,y 1:T ) ⁇ is generated based on image I and current image description information
  • the sample image description generated information y 1:T generated by the network G is obtained; Is based on the image I and the current image description information to generate the sample image reference description information generated by the network G get.
  • the sample image refers to the description information It is the image description information generated by the current image description information generation network G that is different from the description quality of the sample image description generation information y 1:T .
  • sample image reference description information The order of expression is different from the sample image description generation information y 1:T , or, the sample image reference description information Is different from the sample image description generation information y 1:T .
  • the current discriminant network D randomly selects one sample from the above positive and negative samples as the description information of the sample to be discriminated, and discriminates the description information of the sample to be discriminated to obtain the sample discrimination probability value p. Further, the language model Q will also calculate the corresponding evaluation score s. Use the above sample discrimination probability value p and evaluation score s to calculate the sample feedback coefficient r, and adjust and optimize the parameters in the current image description information generation network G according to r, so as to realize the training of the current image description information generation network.
  • the current image description information generation network G k is adjusted according to the sample discrimination probability value p to obtain the trained image description information generation network G k+ 1 and adjust the current discriminant network D k according to the training image description information generation network G k+1 to obtain the trained discriminant network D k+1 ; then, generate the training image description information to generate the network G k+1 Generate the network G k as the current image description information, and use the trained discriminant network D k+1 as the current discriminant network D k , and repeat the above steps to continue training.
  • the current image description information generation network G k is used as the target image description information generation network G target .
  • the current image description information generation network is adjusted according to the sample discrimination probability value to obtain the trained image description information generation network
  • the current discrimination network is adjusted according to the trained image description information generation network to obtain the trained Before discriminating the network, it also includes:
  • the above language model may include, but is not limited to, one or more index parameters for evaluating the generation quality of the image description generation information, such as BLEU, ROUGE, METEOR, CIDEr, SPICE, etc.
  • index parameters for evaluating the generation quality of the image description generation information
  • the above parameters are correlated with human subjective judgment of the image description generation information. Therefore, the comprehensive evaluation score of the above parameters can be used to indicate the correlation between the sample image description generation information and the sample image, such as matching Degree, the matching degree can be used to objectively reflect the generation quality of the image description generation information.
  • the image description information generation network G will generate image description generation information y 1:T corresponding to the image I, and the image description generation information y 1:T Sent to the discriminating network D for discriminating, and sent to the language model Q to obtain the corresponding evaluation score. Then, according to the discrimination result p of the discrimination network D and the evaluation score s of the language model Q, the sample feedback coefficient r for adjusting the image description information generation network G is acquired, so as to realize training and optimization of the image description information generation network according to r.
  • the calculation method of the sample feedback coefficient r may include but not limited to:
  • is the weighted average coefficient.
  • the current image description information generation network is adjusted according to the sample discrimination probability value, and the obtained image description information generation network includes:
  • the parameters adjusted in the image description information generation network include at least one of the following parameters: the current area-based convolutional neural network, the current attention serialized language model, and the current two-layer long-short-term memory network. That is to say, during the adversarial training process, the parameters in at least one structure may be adjusted and optimized to ensure that the image description information generation network obtained by the training has a better generation quality.
  • a network is generated based on the image description information after training to adjust the current discriminant network, and the discriminated network obtained after training includes:
  • the discriminant network in the case where the discriminant network is constructed based on the structure of the convolutional neural network, in the adversarial training process, it may be but not limited to using the description information from the sample image and the sample image description after training
  • the generated information or the sample image after training refers to the randomly selected sample description information in the description information to adjust and optimize the parameters in the structure of the convolutional neural network in the discriminant network to implement the discriminant network and the image description information generation network.
  • the purpose of joint training is not limited to using the description information from the sample image and the sample image description after training.
  • a network is generated based on the image description information after training to adjust the current discriminant network, and the discriminated network obtained after training includes:
  • the discriminant network in the case where the discriminant network is built based on the structure of the recurrent neural network, in the adversarial training process, it may be but not limited to using the description information from the sample image and the sample image description after training to generate Information or training sample images refer to the randomly selected sample description information in the description information to adjust and optimize the parameters in the recurrent neural network structure in the discrimination network to achieve joint training of the discrimination network and the image description information generation network the goal of.
  • FIG. 11-12 show the comparison results on each evaluation index.
  • each column represents different objective evaluation criteria of BLEU, METEOR, ROUGE, CIDEr and SPICE
  • CNN-D and RNN-D in the last two columns are the CNN-based discriminators proposed in the embodiments of this application.
  • CNN-GAN and RNN-GAN are the results of training with CNN discriminator and RNN discriminator, respectively.
  • Ensemble is the result of the integration of 4 CNN-GAN and 4 RNN-GAN models. It can be seen from the comparison result in FIG. 11 that the training method of the embodiment of the present application can effectively increase the values of all objective indicators. The increase ranged from 1.28% to 13.93%.
  • FIG. 12 shows the test results of various algorithms on the MSCOCO competition list. Among them, the last line can be seen that the generation quality of the solution provided by the embodiment of the present application has been comprehensively optimized.
  • an image description information generation device for implementing the above image description information generation method.
  • the above image description information generating device may be, but not limited to, applied to the hardware environment shown in FIG. 1.
  • the device may include:
  • the obtaining unit 1302 is used to obtain the target image to be processed
  • An input unit 1304 for inputting the target image into a target image description information generation network wherein the target image description information generation network is used to generate image descriptions obtained after performing adversarial training using multiple sample images Information generation network, the adversarial training is based on an initial image description information generation network that matches the target image description information generation network, and an initial discriminant network, and alternate training is performed.
  • the discriminant network is used for Determine the output of the image description information generation network;
  • the generating unit 1306 is configured to generate an output result of the network according to the target image description information and generate target image description information for describing the target image.
  • the above-mentioned image description information generating device may be, but not limited to, applied to image recognition scenes, image retrieval scenes, image verification scenes, etc. It is necessary to obtain an image description that matches the image content presented in the image Information scene.
  • the above device further includes:
  • the construction unit 1402 is configured to construct the initialized image description information generation network and the initialized discrimination network before acquiring the target image to be processed;
  • a training unit 1404 configured to perform adversarial training on the initialized image description information generation network and the initialized discriminant network to obtain the target image description information generation network.
  • the construction unit 1102 includes:
  • the network is used to convert the feature vector output by the convolutional neural network structure into a probability value.
  • the convolutional neural network structure includes: an M-layer convolution kernel, and the i-th layer of the M-layer convolution kernel To perform a convolution operation on the sample image vector of the sample image according to the i-th size, where i is a positive integer less than or equal to M, the sample image vector is based on the image feature vector of the sample image and the sample The word feature vector contained in the sample image description information corresponding to the image is determined;
  • the multi-layer perceptron MLP can be, but is not limited to, a neural network structure with forward triggering.
  • the nodes of two adjacent layers are fully connected, there is no connection between nodes of the same layer, and no connection between cross-layers.
  • the recurrent neural network structure includes: an N-layer long-short-term memory network, where N is determined according to a sample image vector of the sample image, and the sample The image vector is determined according to the image feature vector of the sample image and the word feature vector included in the sample image description information corresponding to the sample image.
  • the construction unit 1102 includes:
  • a third building module for constructing the initial image description information generating network using a region-based convolutional neural network, attention serialized language model, and a double-layer long-short-term memory network, wherein the region-based volume
  • a product neural network is used to extract local feature vectors and global feature vectors from the sample image
  • the attention serialized language model is used to perform weighted average processing on the local feature vectors to obtain an average feature vector
  • the double The layer length short-term memory network is used to obtain the object vector to be discriminated by using the average feature vector and the global feature vector, and the object vector to be discriminated will be input to the initialized discriminant network.
  • the training unit 1404 includes:
  • a processing module used to repeatedly perform the following steps until the target image description information generation network is obtained:
  • S3 input the sample image and the sample image description information into the current image description information generation network to obtain sample image description generation information matching the sample image, or a sample image reference matching the sample image Description information, wherein the first degree of matching between the sample image description generation information and the sample image is greater than the second degree of matching between the sample image reference description information and the sample image;
  • S4 Determine the sample description information to be discriminated from the sample image description information, the sample image description generation information, or the sample image reference description information;
  • the training unit 1404 further includes:
  • the current discrimination network before obtaining the trained discrimination network, determine the sample discrimination probability value output by the current discrimination network;
  • An acquisition module for acquiring the first matching degree between the sample image description generation information and the sample image through a language model, wherein the language model includes one or more for evaluating the The sample image describes the parameters of the generated information;
  • a weighted average processing module configured to perform weighted average processing on the sample discrimination probability value and the first matching degree to obtain the sample feedback coefficient.
  • the training unit implements the following steps to adjust the current image description information generation network according to the sample discrimination probability value, and obtaining the trained image description information generation network includes:
  • S1 Adjust the parameters in at least one of the following structures in the current image description information generation network according to the sample discrimination probability value: the current area-based convolutional neural network, the current attention serialized language model, and the current double-layer length and time Memory network.
  • the training unit realizes the adjustment of the current discriminant network according to the generated image description information generation network through the following steps, and the obtained discriminant network after training includes:
  • the training unit realizes the adjustment of the current discriminant network according to the generated image description information generation network through the following steps, and the obtained discriminant network after training includes:
  • the electronic device for implementing the above image description information generation method.
  • the electronic device includes a memory 1502 and a processor 1504, and the memory 1502 stores There is a computer program, and the processor 1504 is configured to execute the steps in any one of the foregoing method embodiments through the computer program.
  • the above-mentioned electronic device may be located in at least one network device among multiple network devices of the computer network.
  • the foregoing processor may be configured to perform the following steps through a computer program:
  • the target image description information generation network is a generation network used for generating image description information after performing adversarial training using multiple sample images, and the adversarial training is based on The initial image description information generation network that matches the target image description information generation network, and the initial discriminant network, and the alternating training, the discriminant network is used to discriminate the output results of the image description information generation network;
  • the structure shown in FIG. 15 is only an illustration, and the electronic device may also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, and a mobile Internet device (Mobile Internet devices, MID), PAD and other terminal devices.
  • FIG. 15 does not limit the structure of the above electronic device.
  • the electronic device may further include more or fewer components than those shown in FIG. 15 (such as a network interface, etc.), or have a configuration different from that shown in FIG. 15.
  • the memory 1502 can be used to store software programs and modules, such as program instructions/modules corresponding to the image description information generation method and apparatus in the embodiments of the present application, and the processor 1304 runs the software programs and modules stored in the memory 1502, thereby Executing various functional applications and data processing, that is, implementing the above image description information generation method.
  • the memory 1302 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 1202 may further include memories remotely provided with respect to the processor 1504, and these remote memories may be connected to the terminal through a network.
  • the memory 1502 may specifically but not limited to store information such as sample characteristics of the item and the target virtual resource account.
  • the memory 1502 may include, but is not limited to, the acquisition unit 1302, the input unit 1304, the generation unit 1306, the construction unit 1402, and the training unit 1404 in the image description information generation method device.
  • it may also include but is not limited to other module units in the device for generating the image description information described above, which will not be repeated in this example.
  • the above-mentioned transmission device 1506 is used to receive or send data via a network.
  • the aforementioned network may include a wired network and a wireless network.
  • the transmission device 1506 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices and routers through a network cable to communicate with the Internet or a local area network.
  • the transmission device 1506 is a radio frequency (Radio Frequency) module, which is used to communicate with the Internet in a wireless manner.
  • Radio Frequency Radio Frequency
  • the above electronic device further includes: a display 1508 for displaying the target image and target image description information to be processed; and a connection bus 1510 for connecting each module component in the above electronic device.
  • a storage medium in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the above method embodiments during runtime.
  • the above storage medium may be set to store a computer program for performing the following steps:
  • the target image description information generation network is a generation network used for generating image description information after performing adversarial training using multiple sample images, and the adversarial training is based on The initial image description information generation network that matches the target image description information generation network, and the initial discriminant network, and the alternating training, the discriminant network is used to discriminate the output results of the image description information generation network;
  • the storage medium may include: a flash disk, a read-only memory (Read-Only Memory, ROM), a random access device (Random Access Memory, RAM), a magnetic disk, or an optical disk.
  • the integrated unit in the above embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in the computer-readable storage medium.
  • the technical solution of the present application may essentially be a part that contributes to the existing technology or all or part of the technical solution may be embodied in the form of a software product, and the computer software product is stored in a storage medium.
  • Several instructions are included to enable one or more computer devices (which may be personal computers, servers, network devices, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
  • the disclosed client may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or software functional unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

一种图像描述信息生成方法和装置及电子装置,其中,该方法包括:获取待处理的目标图像(S202);将目标图像输入目标图像描述信息生成网络,其中,目标图像描述信息生成网络为利用多个样本图像进行对抗式训练后所得到的用于生成图像描述信息的生成网络(S204),对抗式训练是对初始化的图像描述信息生成网络,及初始化的判别网络进行的交替训练,判别网络用于判别图像描述信息生成网络的输出结果;根据目标图像描述信息生成网络的输出结果,生成用于描述目标图像的目标图像描述信息(S206)。该方法解决了相关技术所提供的图像描述信息生成方法存在生成质量较差的技术问题。

Description

图像描述信息生成方法和装置及电子装置
本申请要求于2018年11月30日提交国家知识产权局、申请号为CN201811460241.9、发明创造名称为“图像描述信息生成方法和装置及电子装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,具体而言,涉及一种图像描述信息生成方法和装置及电子装置。
背景技术
为了对图像中所包含的内容进行准确识别,目前常常会采用图像描述生成算法来自动生成与图像中内容相匹配的图像描述信息。其中,常用的生成方式是通过Encoder-Decoder(编码器-解码器)结构,把卷积神经网络(Convolutional Neural Networks,简称CNN)用作编码器,将图像信息从像素空间编码到隐藏空间,然后把循环神经网络(Recurrent Neural Networks,RNN)用作解码器,将隐藏空间中编码后的图像信息解码到语言空间。
然而,采用上述结构所生成的图像描述信息虽然可以表达出图像中的内容,但该图像描述信息中所使用的句子质量却无法保证,如通顺性较差,或不符合日常口语表达习惯。也就是说,相关技术所提供的图像描述信息生成方法存在生成质量较差的问题。
针对上述的问题,目前尚未提出有效的解决方案。
发明内容
本申请实施例提供了一种图像描述信息生成方法和装置及电子装置,以至少解决相关技术所提供的图像描述信息生成方法存在生成质量较差的技术问题。
根据本申请实施例的一个方面,提供了一种图像描述信息生成方法,包括:获取待处理的目标图像;将上述目标图像输入目标图像描述信息生成网络,其中,上述目标图像描述信息生成网络为利用多个样本图像进行对抗式训练后所得到的用于生成图像描述信息的生成网络,上述对抗式训练是基于与上述目标图像描述信息生成网络相匹配的初始化的图像描述信息生成网络,及初始化的判别网络,而进行的交替训练,上述判别网络用于判别上述图像描述信息生成网络的输出结果;根据上述目标图像描述信息生成网络的输出结果,生成用于描述上述目标图像的目标图像描述信息。
根据本申请实施例的另一方面,还提供了一种图像描述信息生成装置,包括:获取单元,用于获取待处理的目标图像;输入单元,用于将上述目标图像输入目标图像描述信息生成网络,其中,上述目标图像描述信息生成网络为利用多个样本图像进行对抗式训练后所得到的用于生成图像描述信息的生成网络,上述对抗式训练是基于与上述目标图像描述信息生成网络相匹配的初始化的图像描述信息生成网络,及初始化的判别网络,而进行的交替训练,上述判别网络用于判别上述图像描述信息生成网络的输出结果;生成单元,用于根据上述目标图像描述信息生成网络的输出结果,生成用于描述上述目标图像的目标图像描述信息。
作为一种可选的示例,上述训练单元还包括:确定模块,用于在上述根据上述样本判别概率值调整上述当前图像描述信息生成网络,得到训练后的图像描述信息生成网络,并根据上述训练后的图像描述信息生成网络调整上述当前判别网络,得到训练后的判别网络之前,确定上述当前判别网络输出的上述样本判别概率值;获取模块,用于通过语言模型获取上述样本图像描述生成信息与上述样本图像之间的上述第一匹配度,其中,上述语言模型中包括一个或多个用于评价上述样本图像描述生成信息的参数;加权平均处理模块,用于对上述样本判别概率值及上述第一匹配度进行加权平均处理,得到上述样本反馈系数。
作为一种可选的示例,上述训练单元通过以下步骤实现上述根据上述样本 判别概率值调整上述当前图像描述信息生成网络,得到训练后的图像描述信息生成网络包括:根据上述样本判别概率值调整上述当前图像描述信息生成网络中以下至少一种结构中的参数:当前基于区域的卷积神经网络、当前注意力序列化语言模型及当前双层长短时记忆网络。
作为一种可选的示例,上述训练单元通过以下步骤实现上述根据上述训练后的图像描述信息生成网络调整上述当前判别网络,得到训练后的判别网络包括:获取上述训练后的图像描述信息生成网络所输出的训练后的样本图像描述生成信息,或训练后的样本图像参考描述信息;利用上述样本图像描述信息、上述训练后的样本图像描述生成信息或上述训练后的样本图像参考描述信息中,调整上述当前判别网络中卷积神经网络结构中的参数,得到上述训练后的判别网络。
作为一种可选的示例上述训练单元通过以下步骤实现所述上述根据所述上述训练后的图像描述信息生成网络调整所述上述当前判别网络,得到训练后的判别网络包括:获取所述上述训练后的图像描述信息生成网络所输出的训练后的样本图像描述生成信息,或训练后的样本图像参考描述信息;利用所述上述样本图像描述信息、所述上述训练后的样本图像描述生成信息或所述上述训练后的样本图像参考描述信息中,调整所述上述当前判别网络中循环神经网络结构中的参数,得到所述上述训练后的判别网络。
根据本申请实施例的又一方面,还提供了一种电子装置,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,上述处理器通过计算机程序执行上述的图像描述信息生成方法。
在本申请实施例中,在获取到待处理的目标图像之后,将其输入通过对抗式训练所得到的目标图像描述信息生成网络,利用该目标图像描述信息生成网络来生成与上述目标图像相匹配的目标图像描述信息。也就是说,不再使用相关技术提供的CNN-RNN结构来生成图像的图像描述信息,而是利用基于对抗式训练所获得的目标图像描述信息生成网络。在对抗式训练过程中引入判别网络,对图像描述信息生成网络的输出结果进行判别,并对二者进行交替训练, 以使最终生成的目标图像描述信息生成网络得到强化学习,从而实现利用目标图像描述信息生成网络所生成的图像描述信息的评价指标得到综合优化,进而达到改善图像描述信息的生成质量,克服了相关技术所提供的图像描述信息生成方法存在生成质量较差的技术问题。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是根据本申请实施例的一种可选的图像描述信息生成方法的硬件环境的示意图;
图2是根据本申请实施例的一种可选的图像描述信息生成方法的流程示意图;
图3是根据本申请实施例的一种可选的图像描述信息生成方法的示意图;
图4是根据本申请实施例的另一种可选的图像描述信息生成方法的示意图;
图5是根据本申请实施例的又一种可选的图像描述信息生成方法的示意图;
图6是根据本申请实施例的又一种可选的图像描述信息生成方法的示意图;
图7是根据本申请实施例的又一种可选的图像描述信息生成方法的示意图;
图8是根据本申请实施例的又一种可选的图像描述信息生成方法的示意图;
图9是根据本申请实施例的又一种可选的图像描述信息生成方法的示意 图;
图10是根据本申请实施例的一种可选的图像描述信息生成方法的评价指标的示意图;
图11是根据本申请实施例的一种可选的图像描述信息生成方法的效果示意图;
图12是根据本申请实施例的另一种可选的图像描述信息生成方法的效果示意图;
图13是根据本申请实施例的一种可选的图像描述信息生成装置的结构示意图;
图14是根据本申请实施例的另一种可选的图像描述信息生成装置的结构示意图;
图15是根据本申请实施例的一种可选的电子装置的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的 那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为了描述上述图像描述信息生成方法,本申请实施例中涉及以下技术术语:
CNN:卷积神经网络(Convolutional Neural Networks),用于提取图像中的图像特征。
RNN:循环神经网络(Recurrent Neural Networks),用于语言建模,学习上下文特征。
R-CNN:基于区域的卷积神经网络(Region-based CNN),用于目标检测定位。
Faster R-CNN:R-CNN的改进版,速度更快,效果更好
RPN:区域推荐网络(Region Proposal Networks),faster R-CNN中的一个模块,用于提取可能存在物体的框的特征向量。
LSTM:长短时记忆网络(Long-Short Term Memory Networks),能够学习到长期以来关系,应用最广泛的一种RNN。
CNN-RNN结构:CNN用作编码器,RNN用作解码器,图像描述算法的通用框架。
Attention mechanism:注意力机制,RNN建模中对输入特征的加权计算。
self-critical:一种基于policy gradient的强化学习方法。
policy gradient:策略梯度,强化学习中的一种方法,直接学习每一个的更新策略。
GANs:生成式对抗网络(Generative Adversary Nets),一种博弈式生成网络,无需预先设定样本概率分布。其中,generator:是生成对抗网络中的生成器。discrimintor:是生成对抗网络中的判别器。
BLEU:双语互译质量评估辅助工具(Bilingual Evaluation Understudy),主要用于机器翻译的质量评价。
ROUGE:文本摘要总结的质量评价标准(Recall-Oriented Understudy for Gisting Evaluation)。
METEOR:一种用于任意语言翻译的质量评价标准。
CIDEr:用于图片描述的质量评价标准(Consensus-based image description evaluation)。
SPICE:基于语义的图片描述质量评价标准(Semantic Propositional Image Caption Evaluation)。
MOCOCO:Microsoft Common Objects in Context数据集,用于关键点检测,目标检测,图片描述等。
Genome:图像密集标注的数据集。
MLE:最大化似然估计(Maximize Likelihood Estimation),用于估计一个概率模型的参数,是RNN的一种训练方式。
根据本申请实施例的一个方面,提供了一种图像描述信息生成方法,可选地,作为一种可选的实施方式,上述图像描述信息生成方法可以但不限于应用于如图1所示的硬件环境中。通过步骤S102用户设备102获取待处理的目标图像,其中,该目标图像中包括角色对象A和墙体对象B。用户设备102中的存储器104存储该目标图像,并由处理器106将该目标图像通过网络108发送给服务器110,如步骤S104-S106。服务器110通过处理引擎114执行步骤S108:如步骤S1082,将接收到的目标图像输入上述目标图像描述信息生成网络,如步骤S1084,生成用于描述该目标图像的目标图像描述信息。其中,上述目标图像描述信息生成网络为利用从数据库112中获取的多个样本图像进行对抗式训练后所得到的用于生成图像描述信息的生成网络,该对抗式训练是基于与目标图像描述生成信息网络相匹配的初始化的图像描述信息生成网 络,和初始化的判别网络,而进行的交替训练,上述判别网络用于判别图像描述信息生成网络的输出结果。如图1所示目标图像,上述目标图像描述信息可以为“角色对象A已翻越墙体对象B”。然后,服务器110将上述生成的目标图像描述信息通过网络108发送给用户设备102进行展示,如步骤S112-S114。
需要说明的是,本实施例中所提供的图像描述信息生成方法,在获取到待处理的目标图像之后,将其输入通过对抗式训练所得到的目标图像描述信息生成网络,利用该目标图像描述信息生成网络来生成与上述目标图像相匹配的目标图像描述信息,其中,该对抗式训练是基于与目标图像描述生成信息网络相匹配的初始化的图像描述信息生成网络,和初始化的判别网络,而进行的交替训练。也就是说,不再使用相关技术提供的CNN-RNN结构来生成图像的图像描述信息,而是利用基于对抗式训练所获得的目标图像描述信息生成网络。在对抗式训练过程中引入判别网络,对图像描述信息生成网络的输出结果进行判别,并对二者进行交替训练,以使最终生成的目标图像描述信息生成网络得到强化学习,从而实现利用目标图像描述信息生成网络所生成的图像描述信息的评价指标得到综合优化,进而达到改善图像描述信息的生成质量,克服相关技术生成质量较差的问题。
可选地,上述图像描述信息生成方法可以但不限于应用于具有图像采集、图像识别或图像处理等功能的终端设备上。其中,上述终端设备可以为用户设备,如手机、平板电脑、笔记本电脑、PC机等终端,也可以为服务器,如数据处理服务器、分布式处理服务器等。进一步,上述图像描述信息生成方法可以在独立的终端设备上完成,也就是终端设备直接获取待处理的目标图像,并利用目标图像描述信息生成网络来生成该目标图像的目标图像描述信息,从而减少由数据传输所导致的生成延迟问题,以实现提高生成效率的效果。此外,上述图像描述信息生成方法也可以通过在至少两个终端设备上通过数据交互来完成,如上述图1所示,在用户设备102中获取待处理的目标图像,然后将该目标图像通过网络108发送给服务器112,由服务器中的目标图像描述信息生成网络来生成该目标图像的目标图像描述信息,再将生成的目标图像描述信 息返回给用户设备102,从而实现通过数据交互完成图像描述信息的生成过程,以减轻用户设备的处理负担。其中,上述网络108可以包括但不限于无线网络或有线网络。其中,该无线网络包括:蓝牙、WIFI及其他实现无线通信的网络。上述有线网络可以包括但不限于:广域网、城域网、局域网。
可选地,作为一种可选的实施方式,如图2所示,上述图像描述信息生成方法包括:
S202,获取待处理的目标图像;
S204,将目标图像输入目标图像描述信息生成网络,其中,目标图像描述信息生成网络为利用多个样本图像进行对抗式训练后所得到的用于生成图像描述信息的生成网络,对抗式训练是基于与目标图像描述信息生成网络相匹配的初始化的图像描述信息生成网络,及初始化的判别网络,而进行的交替训练,判别网络用于判别图像描述信息生成网络的输出结果;
S206,根据目标图像描述信息生成网络的输出结果,生成用于描述目标图像的目标图像描述信息。
可选地,在本实施例中,上述图像描述信息生成方法可以但不限于应用于图像识别场景、图像检索场景、图像校验场景等需要获取与图像中所呈现的图像内容相匹配的图像描述信息的场景。
以图像校验场景为例,在获取到待校验的目标图像后,将该目标图像输入通过对抗式训练所得到的目标图像描述信息生成网络,利用该目标图像描述信息生成网络来生成与上述目标图像相匹配的目标图像描述信息。进一步,对已改善生成质量的目标图像描述信息进行信息校验,以确定该目标图像是否通过验证,从而保证图像校验的准确性。上述场景仅是一种示例,本实施例中对此不做任何限定。
例如,如图3所示,在获取到目标图像之后,将该目标图像输入目标图像描述信息生成网络,以生成与该目标图像相匹配的目标图像描述信息,其中,该目标图像描述信息生成网络,是利用新引入的判别网络经过对抗式训练后所 得到的用于生成图像描述信息的生成网络。在图3所示示例中,目标图像描述信息生成网络所生成的目标图像描述信息可以如下:“角色对象A”、“翻越”、“墙体对象B”。这里仅是一种示例,本实施例中对此不做任何限定。
可选地,在本实施例中,在获取待处理的目标图像之前,还包括:构建初始化的图像描述信息生成网络和初始化的判别网络;对上述初始化的图像描述信息生成网络和初始化的判别网络进行对抗式训练,以得到目标图像描述信息生成网络。
需要说明的是,在本实施例中所构建的用于进行对抗式训练的网络训练框架可以但不限于如图4所示,将样本图像依次输入网络训练框架中,图像描述信息生成网络G将生成与样本图像对应的样本图像描述生成信息,并将该样本图像描述生成信息发送给判别网络D进行判别,发送给语言模型Q以获取对应的评价分值,根据判别网络D的判别结果p和语言模型Q的评价分值s,来获取用于调整图像描述信息生成网络G的反馈系数r,从而实现根据r来训练优化图像描述信息生成网络,并进一步利用训练优化后的图像描述信息生成网络,来训练优化判别网络D,以此类推对图像描述信息生成网络G和判别网络D进行交替训练,从而得到最终的收敛的目标图像描述信息生成网络。
需要说明的是,上述语言模型可以但不限于包括一个或多个用于评价图像描述生成信息的生成质量的指标参数,如BLEU,ROUGE,METEOR,CIDEr,SPICE等。其中,上述参数与人类对图像描述生成信息的主观评判具有相关性,因而,上述参数的综合评价分值,将可以实现客观反映出图像描述生成信息的生成质量的效果。
可选地,在本实施例中所构建的初始化的图像描述信息生成网络可以但不限于包括:卷积神经网络CNN、注意力序列化语言模型Attention及循环神经网络RNN。其中,其中,CNN用于提取图像中的图像特征,Attention是序列化语言模型中用于进行权重更新的机制,RNN用于学习上下文特征。
例如,如图5所示,假设样本图像为图像I,对应的样本图像描述信息为 x 1:T。图像I输入CNN,CNN提取该图像I的局部特征向量,例如,{v 1,v 2,…,v k|k={10,11,12,…,100}},及全局特征向量
Figure PCTCN2019111946-appb-000001
将局部特征向量输入Attention,以进行加权平均处理得到
Figure PCTCN2019111946-appb-000002
其中,
Figure PCTCN2019111946-appb-000003
与时刻t相关。将
Figure PCTCN2019111946-appb-000004
输入RNN,将x 1:T通过词嵌入矩阵Embedding输入RNN。然后,将RNN的输出结果,作为图像描述信息生成网络所生成的与图像I相匹配的图像描述生成信息y 1:T,并输入判别网络D和语言模型Q,以便于通过交替训练调整优化图像描述信息生成网络G。其中,在本实施例中,词嵌入矩阵Embedding是用于线性变换的模型。
可选地,在本实施例中,上述卷积神经网络CNN可以但不限于采用基于区域的卷积神经网络R-CNN的改进版(即Faster R-CNN),其主干网络是Resnet101,可以在MSCOCO和Genome数据集上预训练好。上述注意力序列化语言模型Attention采用软注意力(即soft attention)策略,对每个图像的图像向量进行加权平均处理。上述循环神经网络RNN可以但不限于采用双层长短时记忆网络LSTM结构。
可选地,在本实施例中所构建的初始化的判别网络可以但不限于包括以下一种:
1)基于卷积神经网络结构的第一初始化判别网络。其中,在该第一初始化判别网络中,卷积神经网络输出的特征向量,将输入第一多层感知机(Multi-Layer Perception,简称MLP)及第一分类网络(如softmax),转化得到用于指示判别结果的概率值。其中,上述卷积神经网络可以包括但不限于M层卷积核,M层卷积核中第i层卷积核用于对样本图像的样本图像向量按照第i种尺寸进行卷积运算,所述i为小于等于M的正整数。
2)基于循环神经网络结构的第二初始化判别网络。其中,在该第二初始化判别网络中,循环神经网络输出的特征向量,将输入第二多层感知机 (Multi-Layer Perception,简称MLP)及第二分类网络(如softmax),转化得到用于指示判别结果的概率值。其中,上述循环神经网络可以包括但不限于标准的N层LSTM。
可选地,在本实施例中,在对抗式训练过程中,图像描述信息生成网络G将生成与图像对应的图像描述生成信息,并将该图像描述生成信息发送给判别网络D进行判别,发送给语言模型Q以获取对应的评价分值,然后,根据判别网络D的判别结果p和语言模型Q的评价分值s,来获取用于调整图像描述信息生成网络G的反馈系数r,从而实现根据r来训练优化图像描述信息生成网络。其中,上述反馈系数r的计算方式可以包括但不限于:
r=λ·p+(1-λ)·s       (1)
其中,λ为加权平均系数。
通过上述公式(1)将获取到目标图像输入待训练的图像描述信息生成网络和判别网络之后得到的反馈系数r,根据该反馈系数r的取值,来对图像描述信息生成网络,进行调整优化,进一步再利用调整后的图像描述信息生成网络的输出结果对判别网络进行调整优化,通过交替训练最终达到收敛,从而得到目标图像描述信息生成网络。
可选地,在本实施例中,利用上述附图所示网络训练框架经过对抗式训练得到的目标图像描述信息生成网络,来对待处理的目标图像进行学习,以生成与该目标图像相匹配的已得到改善优化的目标图像描述信息,从而实现提高图像描述信息的生成质量的目的。
通过本申请提供的实施例,利用基于对抗式训练所获得的目标图像描述信息生成网络。在对抗式训练过程中引入判别网络,对图像描述信息生成网络的输出结果进行判别,并对二者进行交替训练,以使最终生成的目标图像描述信息生成网络得到强化学习,从而实现利用目标图像描述信息生成网络所生成的图像描述信息的评价指标得到综合优化,进而达到改善图像描述信息的生成质量。
作为一种可选的方案,在获取待处理的目标图像之前,还包括:
S1,构建初始化的图像描述信息生成网络,及初始化的判别网络;
S2,对初始化的图像描述信息生成网络,及初始化的判别网络进行对抗式训练,得到目标图像描述信息生成网络。
可选地,在本实施例中,在获取待处理的目标图像之前,需要先构建初始化的图像描述信息生成网络及初始化的判别网络。然后需要对上述初始化的图像描述信息生成网络及初始化的判别网络进行预训练,再对预训练后的图像描述信息生成网络及判别网络进行对抗式训练。
作为一种可选的构建方式,初始化的图像描述信息生成网络可以但不限于为基于区域的卷积神经网络、注意力序列化语言模型及具有双层长短时记忆网络的循环神经网络进行构建。例如,构建的初始化的图像描述信息生成网络的框架可以参考图5所示图像描述信息生成网络G。
作为一种可选的构建方式,初始化的判别网络可以但不限于包括:CNN型判别网络、RNN型判别网络。其中,CNN型网络可以但不限于为基于卷积神经网络、第一多层感知机及第一分类网络所构建的第一初始化判别网络,RNN型判别网络可以但不限于为基于循环神经网络、第二多层感知机及第二分类网络所构建的第二初始化判别网络。
进一步,在本实施例中,在构建出上述初始化的图像描述信息生成网络和初始化的判别网络之后,对二者进行预训练,步骤可以如下:
例如,假设获取到初始化的图像描述信息生成网络G 0,初始化的判别网路D 0和预训练集S,其中,S={(I,x 1:T)}。在训练集S上用最大似然法MLE对G 0进行预训练,得到预训练后的G θ。利用G θ生成预训练集S D,其中,
Figure PCTCN2019111946-appb-000005
然后,在S D上对D 0进行预训练得到D φ。θ和φ分别为图像描述信息生成网络G和判别网络D中通过训练确认的参数。
进一步,利用预训练后的G θ和预训练后的D φ开始交替训练,以实现对两个神经网络的对抗式训练,从而达到优化图像描述信息生成网络G的生成质量的目的。
通过本申请提供的实施例,通过构建初始化的图像描述信息生成网络和初始化的判别网络,并对上述构建好的初始化的图像描述信息生成网络和初始化的判别网络进行对抗式训练,以达到互相制约训练的目的,进而实现对图像描述信息生产网络的生成质量进行优化改善的效果。
作为一种可选的方案,构建初始化的判别网络包括:
1)基于卷积神经网络、第一多层感知机及第一分类网络构建第一初始化判别网络,其中,第一多层感知机及第一分类网络用于将卷积神经网络结构输出的特征向量转化为概率值,卷积神经网络包括:M层卷积核,M层卷积核中第i层卷积核用于对样本图像的样本图像向量按照第i种尺寸进行卷积运算,i为小于等于M的正整数,样本图像向量是根据样本图像的图像特征向量及样本图像对应的样本图像描述信息中包含的词特征向量确定的;
需要说明的是,多层感知机MLP可以但不限于为前向触发的神经网络结构,相邻两层的节点之间全连接,同一层节点之间无连接,跨层之间无连接。
具体结合图6所示进行说明。在第一初始化判别网络中包括具有M层卷积核的卷积神经网络结构,第一多层感知机(Multi-Layer Perception,简称MLP)及第一分类网络(如softmax)。其中,M层卷积核中每层卷积核用于指示一种用于进行卷积运算的尺寸,如第i层卷积核是第i种尺寸进行卷积,对应的卷积核数量为n i个。第一MLP及第一分类网络(如softmax),用于将上述M层卷积核的输出结果进行转化,得到用于指示判别结果的概率值。
例如,假设样本图像为图像I,及与该图像I对应的样本图像描述信息x 1:T。图像I通过CNN将得到一个d维图像特征向量
Figure PCTCN2019111946-appb-000006
同时将上述样本图像描述信息x 1:T输入词嵌入矩阵Embedding,得到T个d维的词特征向量。然后将上 述T+1个特征向量进行级联,得到特征矩阵:
Figure PCTCN2019111946-appb-000007
其中,ε∈R d×(T+1),再用不同尺寸的卷积核w对ε作卷积,得到新的特征向量:
c=[c 1,c 2,…,c T-l+2]   (3)
其中,
c i=ReLU(w*ε i:i+l-1+b)     (4)
上述M层卷积核具有M种不同尺寸,其中,第i种尺寸的卷积核有n i个。也就是说,不同尺寸的卷积核w共有
Figure PCTCN2019111946-appb-000008
假设T=16,则卷积核窗口大小和数量可以如表1所示:
表1
Figure PCTCN2019111946-appb-000009
进一步,在获取到新的特征向量c后,再对c做最大值池化(max-pooling),将所有c级联起来得到新的特征向量
Figure PCTCN2019111946-appb-000010
然后通过一个高速公路(high way)结构的多层感知机MLP,结构如下:
Figure PCTCN2019111946-appb-000011
其中,上述参数W T、b T、σ、W H、b H为训练过程中所要确定的参数。
最后,通过全连接层加sigmoid函数,输出用于判别图像描述信息生成网络所生成的与图像I匹配的图像描述生成信息的真假的概率值:
Figure PCTCN2019111946-appb-000012
2)基于循环神经网络、第二多层感知机及第二分类网络构建第二初始化判别网络,其中,第二多层感知机及第二分类网络用于将循环神经网络结构输出的特征向量转化为概率值,循环神经网络结构包括:N层长短时记忆网络,N根据样本图像的样本图像向量确定,样本图像向量是根据样本图像的图像特征向量及样本图像对应的样本图像描述信息中包含的词特征向量确定的。
具体结合图7所示进行说明。在第二初始化判别网络中包括具有N层LSTM的循环神经网络,第二多层感知机(Multi-Layer Perception,简称MLP)及第二分类网络(如softmax)。其中,第二MLP及第二分类网络softmax,用于将上述N层LSTM的输出结果进行转化,得到用于指示判别结果的概率值。
例如,假设样本图像为图像I,及与该图像I对应的样本图像描述信息x 1:T。图像I通过CNN将得到一个d维图像特征向量
Figure PCTCN2019111946-appb-000013
作为第一层LSTM的输入,之后每一层LSTM将分别输入上述样本图像描述信息x 1:T中对应的一个词特征向量,以得到对应的隐藏向量h i
Figure PCTCN2019111946-appb-000014
最后通过全连接层和sigmoid层输出,用于判别图像描述信息生成网络所生成的与图像I匹配的图像描述生成信息的真假的概率值:
p=σ(W R·h t+1+b R)   (8)
其中,上述参数W R、b R、σ为训练过程中所要确定的参数。
通过本申请提供的实施例,通过引入判别网络与图像描述信息生成网络进行对抗式训练,以提高图像描述信息生成网络的生成质量,其中,上述判别网络在本实施例中提供了两种构建结构,分别是基于卷积神经网络CNN结构,和基于循环神经网络RNN结构。通过不同的结构的判别网络,将使得对抗式训练过程更加多样化,有利于改善训练效果。
作为一种可选的方案,构建初始化的图像描述信息生成网络包括:
S1,利用基于区域的卷积神经网络、注意力序列化语言模型及双层长短时记忆网络构建初始化的图像描述信息生成网络,其中,基于区域的卷积神经网络用于从样本图像中提取出局部特征向量和全局特征向量;注意力序列化语言模型用于对局部特征向量进行加权平均处理,得到平均特征向量;双层长短时记忆网络用于利用平均特征向量及全局特征向量得到待判别对象向量,并将待判别对象向量输入初始化的判别网络。
需要说明的是,在本实施例中,RNN可以但不限于采用top-down模型,该模型采用了双层长短时记忆网络LSTM,在训练过程中交叉输入与输出。可选地,在本实施例中,上述待判别对象向量可以包括但不限于为双层长短时记忆网络LSTM输出的隐藏向量
Figure PCTCN2019111946-appb-000015
具体结合图8所示进行说明,假设样本图像为图像I,对应的样本图像描述信息为x 1:T。图像I输入Faster R-CNN,Faster R-CNN提取该图像I的局部特征向量,例如,{v 1,v 2,…,v k|k={10,11,12,…,100}},及全局特征向量
Figure PCTCN2019111946-appb-000016
将局部特征向量输入Soft Attention,以进行加权平均处理得到
Figure PCTCN2019111946-appb-000017
其中,
Figure PCTCN2019111946-appb-000018
与时 刻t相关。将
Figure PCTCN2019111946-appb-000019
输入RNN中第一层LSTM1,并将x 1:T通过词嵌入矩阵Embedding输入RNN中第一层LSTM1。将
Figure PCTCN2019111946-appb-000020
输入RNN中第二层LSTM2。其中,LSTM1根据上一时刻t-1的隐藏向量确定当前时刻t的隐藏向量,如对于第一层LSTM1根据隐藏向量
Figure PCTCN2019111946-appb-000021
和隐藏向量
Figure PCTCN2019111946-appb-000022
将确定出隐藏向量
Figure PCTCN2019111946-appb-000023
对于第二层LSTM2根据隐藏向量
Figure PCTCN2019111946-appb-000024
和隐藏向量
Figure PCTCN2019111946-appb-000025
将确定出隐藏向量
Figure PCTCN2019111946-appb-000026
其中,LSTM1的输出
Figure PCTCN2019111946-appb-000027
会用于训练Soft Attention中的权重,LSTM2的输出
Figure PCTCN2019111946-appb-000028
将通过softmax层输出至判别网络D,并进一步可以计算得到本次训练对应的损失loss,该损失loss将用于交替训练调整优化图像描述信息生成网络G。其中,在本实施例中,词嵌入矩阵Embedding是用于线性变换的模型。
通过本申请提供的实施例,利用基于区域的卷积神经网络、注意力序列化语言模型及双层长短时记忆网络构建初始化的图像描述信息生成网络,基于上述初始化的图像描述信息生成网络,再引入判别网络进行交替训练,将有利于对图像描述信息生成网络的优化改善,从而克服相关技术中基于CNN-RNN结构所生成的图像描述信息的生成质量较差的问题。
作为一种可选的方案,对初始化的图像描述信息生成网络,及初始化的判别网络进行对抗式训练,得到目标图像描述信息生成网络包括:
S1,重复执行以下步骤,直至得到目标图像描述信息生成网络:
S12,确定当前图像描述信息生成网络和当前判别网络,其中,当前图像描述信息生成网络的初始值为初始化的图像描述信息生成网络,当前判别网络的初始值为初始化的判别网络;
S14,获取样本图像及与样本图像对应的样本图像描述信息;
S16,将样本图像和样本图像描述信息,输入当前图像描述信息生成网络,得到与样本图像匹配的样本图像描述生成信息,或与样本图像匹配的样本图像 参考描述信息,其中,样本图像描述生成信息与样本图像之间的第一匹配度,大于样本图像参考描述信息与样本图像之间的第二匹配度;
S18,从样本图像描述信息、样本图像描述生成信息或样本图像参考描述信息中确定出待判别样本描述信息;
S20,将样本图像和待判别样本描述信息输入当前判别网络,得到样本判别概率值和样本反馈系数;
S22,在样本反馈系数指示样本判别概率值尚未达到收敛条件的情况下,根据样本判别概率值调整当前图像描述信息生成网络,得到训练后的图像描述信息生成网络,并根据训练后的图像描述信息生成网络调整当前判别网络,得到训练后的判别网络;将训练后的图像描述信息生成网络作为当前图像描述信息生成网络,并将训练后的判别网络作为当前判别网络;在样本反馈系数指示样本判别概率值已达到收敛条件的情况下,将当前图像描述信息生成网络作为目标图像描述信息生成网络。
具体结合图9所示示例进行说明。假设获取到的样本图像为图像I,对应的样本图像描述信息为x 1:T。当前图像信息生成网络和当前判别网络的网络框架以上述示例中所构建的框架为例。
将图像I输入当前图像信息生成网络中的Faster R-CNN,Faster R-CNN提取该图像I的局部特征向量,例如,{v 1,v 2,…,v k|k={10,11,12,…,100}},及全局特征向量
Figure PCTCN2019111946-appb-000029
将局部特征向量输入Soft Attention,以进行加权平均处理得到
Figure PCTCN2019111946-appb-000030
其中,
Figure PCTCN2019111946-appb-000031
与时刻t相关。将作为图像I的图像特征向量的全局特征向量
Figure PCTCN2019111946-appb-000032
分别输入双层LSTM和判别网络D。将样本图像描述信息为x 1:T输入当前图像信息生成网络中的词嵌入矩阵Embedding,得到图像I对应的词特征向量。其中,上述图像特征向量和词特征向量构成用于标识图像I的特征的图像向量。
进一步,在基于上述构建的网络框架进行对抗式训练的过程中,当前判别 网络D将获得正样本{(I,x 1:T)},及负样本:{(I,y 1:T)}和
Figure PCTCN2019111946-appb-000033
其中,正样本{(I,x 1:T)}是根据图像I及样本图像描述信息x 1:T得到;负样本{(I,y 1:T)}根据图像I及当前图像描述信息生成网络G所生成的样本图像描述生成信息y 1:T得到;
Figure PCTCN2019111946-appb-000034
是根据图像I及当前图像描述信息生成网络G所生成的样本图像参考描述信息
Figure PCTCN2019111946-appb-000035
得到。其中,样本图像参考描述信息
Figure PCTCN2019111946-appb-000036
是当前图像描述信息生成网络G所生成的与样本图像描述生成信息y 1:T的描述质量不一样的图像描述信息。例如,样本图像参考描述信息
Figure PCTCN2019111946-appb-000037
的表达顺序与样本图像描述生成信息y 1:T不同,或,样本图像参考描述信息
Figure PCTCN2019111946-appb-000038
的表达习惯与样本图像描述生成信息y 1:T不同。需要说明的是,样本图像描述生成信息y 1:T与图像I的匹配度,相对样本图像参考描述信息
Figure PCTCN2019111946-appb-000039
的与图像I的匹配度更高,也就是说,样本图像描述生成信息y 1:T的生成质量高于样本图像参考描述信息
Figure PCTCN2019111946-appb-000040
的生成质量。
然后,当前判别网络D将从上述正样本和负样本中随机选择一个样本作为待判别样本描述信息,并对该待判别样本描述信息进行判别,得到样本判别概率值p。进一步,语言模型Q也将计算对应的评价分值s。利用上述样本判别概率值p和评价分值s来计算样本反馈系数r,根据r来调整优化当前图像描述信息生成网络G中参数,以实现对当前图像描述信息生成网络的训练。
其中,在样本反馈系数r指示样本判别概率值p尚未达到收敛条件的情况下,则根据样本判别概率值p调整当前图像描述信息生成网络G k,得到训练后的图像描述信息生成网络G k+1,并根据训练后的图像描述信息生成网络G k+1调整当前判别网络D k,得到训练后的判别网络D k+1;然后,再将训练后的图像描述信息生成网络G k+1作为当前图像描述信息生成网络G k,并将训练后的判别网络D k+1作为当前判别网络D k,重复上述步骤继续训练。在样本 反馈系数r指示样本判别概率值p已达到收敛条件的情况下,将当前图像描述信息生成网络G k作为目标图像描述信息生成网络G 目标
通过本申请提供的实施例,在确定当前图像描述信息生成网络和当前判别网络之后,对二者重复执行交替训练,以实现对抗训练优化,直至得到图像描述信息生成质量得到提升的目标图像描述信息生成网络,从而克服相关技术中仅利用RNN-CNN结构对图像进行简单的编解码操作所得到的图像描述信息的描述质量较差的问题,进一步提升图像描述的质量。
作为一种可选的方案,在根据样本判别概率值调整当前图像描述信息生成网络,得到训练后的图像描述信息生成网络,并根据训练后的图像描述信息生成网络调整当前判别网络,得到训练后的判别网络之前,还包括:
S1,确定当前判别网络输出的样本判别概率值;
S2,通过语言模型获取样本图像描述生成信息与样本图像之间的第一匹配度,其中,语言模型中包括一个或多个用于评价样本图像描述生成信息的参数;
S3,对样本判别概率值及第一匹配度进行加权平均处理,得到样本反馈系数。
需要说明的是,上述语言模型可以但不限于包括一个或多个用于评价图像描述生成信息的生成质量的指标参数,如BLEU,ROUGE,METEOR,CIDEr,SPICE等。其中,上述参数与人类对图像描述生成信息的主观评判具有相关性,因而,上述参数的综合评价分值,可以用于指示样本图像描述生成信息和样本图像二者之间的关联性,如匹配度,进一步可以利用该匹配度来客观反映出图像描述生成信息的生成质量。
具体结合图9所示示例进行说明,在对抗式训练过程中,图像描述信息生成网络G将生成与图像I对应的图像描述生成信息y 1:T,并将该图像描述生成信息y 1:T发送给判别网络D进行判别,发送给语言模型Q以获取对应的评价分值。然后,根据判别网络D的判别结果p和语言模型Q的评价分值s,来获取 用于调整图像描述信息生成网络G的样本反馈系数r,从而实现根据r来训练优化图像描述信息生成网络。其中,上述样本反馈系数r的计算方式可以包括但不限于:
r=λ·p+(1-λ)·s        (9)
其中,λ为加权平均系数。
通过本申请提供的实施例,结合判别网络和语言模型,来共同确定对图像描述信息生成网络的调整优化内容,将有利于提升对图像描述信息生成网络的训练质量,从而使得最终训练得到的目标图像描述信息生成网络所生成的目标图像描述信息的质量更好,更有利于客观准确地反映出图像中的内容。
作为一种可选的方案,根据样本判别概率值调整当前图像描述信息生成网络,得到训练后的图像描述信息生成网络包括:
S1,根据样本判别概率值调整当前图像描述信息生成网络中以下至少一种结构中的参数:当前基于区域的卷积神经网络、当前注意力序列化语言模型及当前双层长短时记忆网络。
可选地,在本实施例,在图像描述信息生成网络是基于区域的卷积神经网络、注意力序列化语言模型及双层长短时记忆网络而构建的情况下,则在对抗式训练过程中,上述图像描述信息生成网络中被调整的参数包括以下至少一种结构中的参数:当前基于区域的卷积神经网络、当前注意力序列化语言模型及当前双层长短时记忆网络。也就是说,在对抗式训练过程中,可以但不限于对至少一种结构中的参数进行调整优化,以确保训练所得到的图像描述信息生成网络的生成质量更好。
作为一种可选的方案,根据训练后的图像描述信息生成网络调整当前判别网络,得到训练后的判别网络包括:
S1,获取训练后的图像描述信息生成网络所输出的训练后的样本图像描述生成信息,或训练后的样本图像参考描述信息;
S2,利用样本图像描述信息、训练后的样本图像描述生成信息或训练后的样本图像参考描述信息中,调整当前判别网络中卷积神经网络结构中的参数,得到训练后的判别网络。
可选地,在本实施例,在判别网络是基于卷积神经网络结构而构建的情况下,则在对抗式训练过程中,可以但不限于利用从样本图像描述信息、训练后的样本图像描述生成信息或训练后的样本图像参考描述信息中随机选择的待判别样本描述信息,来对判别网络中卷积神经网络结构中的参数进行调整优化,以实现对判别网络和图像描述信息生成网络进行联合训练的目的。
作为一种可选的方案,根据训练后的图像描述信息生成网络调整当前判别网络,得到训练后的判别网络包括:
S1,获取训练后的图像描述信息生成网络所输出的训练后的样本图像描述生成信息,或训练后的样本图像参考描述信息;
S2,利用样本图像描述信息、训练后的样本图像描述生成信息或训练后的样本图像参考描述信息中,调整当前判别网络中循环神经网络结构中的参数,得到训练后的判别网络。
可选地,在本实施例,在判别网络是基于循环神经网络结构而构建的情况下,则在对抗式训练过程中,可以但不限于利用从样本图像描述信息、训练后的样本图像描述生成信息或训练后的样本图像参考描述信息中随机选择的待判别样本描述信息,来对判别网络中循环神经网络结构中的参数进行调整优化,以实现对判别网络和图像描述信息生成网络进行联合训练的目的。
具体结合以下示例说明。假设获取到图像描述生成网络G θ;判别网络
Figure PCTCN2019111946-appb-000041
语言模型Q;训练集S={(I,x 1:T)}。则通过以下步骤进行对抗式训练,以得到图像描述生成网络G θ的最优参数θ和判别网络
Figure PCTCN2019111946-appb-000042
的最优参数φ。
S1:随机获取初始化的G θ
Figure PCTCN2019111946-appb-000043
S2:在训练集S上用MLE方法预训练G θ
S3:用预训练好的G θ生成预训练集
Figure PCTCN2019111946-appb-000044
S4:在S D上预训练
Figure PCTCN2019111946-appb-000045
S5:重复执行以下步骤,直到满足收敛条件:
S6:for g-steps=1:g do
S7:用G θ生成一个mini-batch{(I,y 1:T)}。
S8:通过
Figure PCTCN2019111946-appb-000046
计算p值。
S9:通过Q计算s值。
S10:结合
Figure PCTCN2019111946-appb-000047
和Q计算r值。
S11:用强化学习方法self-critical更新参数θ。
S12:end for
S13:for d-steps=1:d do
S14:用G θ生成负样本{(I,y 1:T)},结合负样本
Figure PCTCN2019111946-appb-000048
和正样本{(I,x 1:T)}。
S15:更新参数φ
S16:end for
优选地,通过上述对抗式训练可以但不限于确定参数可以如下:λ=0.3,Q=CIDErD,g=1,d=1。这里仅是示例,本实施例中对此不做任何限定。
如图10所示给出了各个客观评价指标(BLEU,ROUGE,METEOR,CIDEr,SPICE)和用户主观评价指标的相关性,可以看出SPICE和用户评价指标关联性最大,METEOR和CIDEr相关性也不错,BLEU和ROUGE就比较低了。
通过本申请实施例所提供的目标图像描述信息生成网络所生成的目标图像描述信息,生成质量有明显改善提升。其中,本申请实施例中的图像描述生成框架也可以应用于其他基于强化学习训练的图像描述算法中。具体如图11-12示出了在各个评价指标上的比对结果。其中,在图11中,各列分别表示 BLEU,METEOR,ROUGE,CIDEr和SPICE不同的客观评价标准,最后两列中CNN-D和RNN-D分别为本申请实施例所提出的基于CNN判别器得到的目标图像描述信息生成网络和基于RNN判别器得到的目标图像描述信息生成网络的判别结果。None是指没有用GANs的训练方法,CNN-GAN和RNN-GAN分别为用CNN判别器和RNN判别器训练的结果。Ensemble是4个CNN-GAN和4个RNN-GAN模型集成的结果。从图11的比对结果看出,使用本申请实施例的训练方法,能够有效的提高所有客观指标的数值。提高幅度从1.28%到13.93%不等。图12所示为各种算法在MSCOCO竞赛榜单上的测试结果,其中,最后一行可以看出本申请实施例所提供的方案的生成质量已得到综合优化。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
根据本申请实施例的另一个方面,还提供了一种用于实施上述图像描述信息生成方法的图像描述信息生成装置。作为一种可选的实施方式,上述图像描述信息生成装置可以但不限于应用于如图1所示的硬件环境中。可选的,如图13所示,该装置可以包括:
1)获取单元1302,用于获取待处理的目标图像;
2)输入单元1304,用于将所述目标图像输入目标图像描述信息生成网络,其中,所述目标图像描述信息生成网络为利用多个样本图像进行对抗式训练后所得到的用于生成图像描述信息的生成网络,所述对抗式训练是基于与所述目标图像描述信息生成网络相匹配的初始化的图像描述信息生成网络,及初始化的判别网络,而进行的交替训练,所述判别网络用于判别所述图像描述信息生成网络的输出结果;
3)生成单元1306,用于根据所述目标图像描述信息生成网络的输出结果,生成用于描述所述目标图像的目标图像描述信息。
可选地,在本实施例中,上述图像描述信息生成装置可以但不限于应用于图像识别场景、图像检索场景、图像校验场景等需要获取与图像中所呈现的图像内容相匹配的图像描述信息的场景。作为一种可选的方案,如图14所示,上述装置还包括:
1)构建单元1402,用于在所述获取待处理的目标图像之前,构建所述初始化的图像描述信息生成网络,及所述初始化的判别网络;
2)训练单元1404,用于对所述初始化的图像描述信息生成网络,及所述初始化的判别网络进行对抗式训练,得到所述目标图像描述信息生成网络。
作为一种可选的方案,构建单元1102包括:
1)第一构建模块,用于基于卷积神经网络结构、第一多层感知机及第一分类网络构建第一初始化判别网络,其中,所述第一多层感知机及所述第一分类网络用于将所述卷积神经网络结构输出的特征向量转化为概率值,所述卷积神经网络结构包括:M层卷积核,所述M层卷积核中第i层卷积核用于对所述样本图像的样本图像向量按照第i种尺寸进行卷积运算,所述i为小于等于M的正整数,所述样本图像向量是根据所述样本图像的图像特征向量及所述样本图像对应的样本图像描述信息中包含的词特征向量确定的;
需要说明的是,多层感知机MLP可以但不限于为前向触发的神经网络结构,相邻两层的节点之间全连接,同一层节点之间无连接,跨层之间无连接。
关于第一初始化判别网络的结构可以参见上文关于图6的相关描述,此处不再赘述。2)第二构建模块,用于基于循环神经网络结构、第二多层感知机及第二分类网络构建第二初始化判别网络,其中,所述第二多层感知机及所述第二分类网络用于将所述循环神经网络结构输出的特征向量转化为概率值,所述循环神经网络结构包括:N层长短时记忆网络,所述N根据所述样本图像的样本图像向量确定,所述样本图像向量是根据所述样本图像的图像特征向量 及所述样本图像对应的样本图像描述信息中包含的词特征向量确定的。
关于第二初始化判别网络的结构可以参见上文关于图7的相关描述,此处不再赘述。
作为一种可选的方案,构建单元1102包括:
1)第三构建模块,用于利用基于区域的卷积神经网络、注意力序列化语言模型及双层长短时记忆网络构建所述初始化的图像描述信息生成网络,其中,所述基于区域的卷积神经网络用于从所述样本图像中提取出局部特征向量和全局特征向量;所述注意力序列化语言模型用于对所述局部特征向量进行加权平均处理,得到平均特征向量;所述双层长短时记忆网络用于利用所述平均特征向量及所述全局特征向量得到待判别对象向量,所述待判别对象向量将输入所述初始化的判别网络。
关于第三构建模块的具体实现可以参见上文图8的相关说明,此处不再赘述。作为一种可选的方案,训练单元1404包括:
1)处理模块,用于重复执行以下步骤,直至得到所述目标图像描述信息生成网络:
S1,确定当前图像描述信息生成网络和当前判别网络,其中,所述当前图像描述信息生成网络的初始值为所述初始化的图像描述信息生成网络,所述当前判别网络的初始值为所述初始化的判别网络;
S2,获取所述样本图像及与所述样本图像对应的样本图像描述信息;
S3,将所述样本图像和所述样本图像描述信息,输入所述当前图像描述信息生成网络,得到与所述样本图像匹配的样本图像描述生成信息,或与所述样本图像匹配的样本图像参考描述信息,其中,所述样本图像描述生成信息与所述样本图像之间的第一匹配度,大于所述样本图像参考描述信息与所述样本图像之间的第二匹配度;
S4,从所述样本图像描述信息、所述样本图像描述生成信息或所述样本图像参考描述信息中确定出待判别样本描述信息;
S5,将所述样本图像和所述待判别样本描述信息输入所述当前判别网络,得到样本判别概率值和样本反馈系数;
S6,在所述样本反馈系数指示所述样本判别概率值尚未达到收敛条件的情况下,根据所述样本判别概率值调整所述当前图像描述信息生成网络,得到训练后的图像描述信息生成网络,并根据所述训练后的图像描述信息生成网络调整所述当前判别网络,得到训练后的判别网络;将所述训练后的图像描述信息生成网络作为所述当前图像描述信息生成网络,并将所述训练后的判别网络作为所述当前判别网络;在所述样本反馈系数指示所述样本判别概率值已达到所述收敛条件的情况下,将所述当前图像描述信息生成网络作为所述目标图像描述信息生成网络。
关于处理模块的具体实现过程可以参见上文图9的相关描述,此处不再赘述。
作为一种可选的方案,训练单元1404还包括:
1)确定模块,用于在所述根据所述样本判别概率值调整所述当前图像描述信息生成网络,得到训练后的图像描述信息生成网络,并根据所述训练后的图像描述信息生成网络调整所述当前判别网络,得到训练后的判别网络之前,确定所述当前判别网络输出的所述样本判别概率值;
2)获取模块,用于通过语言模型获取所述样本图像描述生成信息与所述样本图像之间的所述第一匹配度,其中,所述语言模型中包括一个或多个用于评价所述样本图像描述生成信息的参数;
3)加权平均处理模块,用于对所述样本判别概率值及所述第一匹配度进行加权平均处理,得到所述样本反馈系数。
其中,关于语言模型的具体实现可以参见上文相关描述,此处不再赘述。
关于训练单元的具体实现过程可以参见上文图9相关描述,此处不再赘述。
作为一种可选的方案,所述训练单元通过以下步骤实现所述根据所述样本判别概率值调整所述当前图像描述信息生成网络,得到训练后的图像描述信息生成网络包括:
S1,根据所述样本判别概率值调整所述当前图像描述信息生成网络中以下至少一种结构中的参数:当前基于区域的卷积神经网络、当前注意力序列化语言模型及当前双层长短时记忆网络。
作为一种可选的方案,所述训练单元通过以下步骤实现所述根据所述训练后的图像描述信息生成网络调整所述当前判别网络,得到训练后的判别网络包括:
S1,获取所述训练后的图像描述信息生成网络所输出的训练后的样本图像描述生成信息,或训练后的样本图像参考描述信息;
S2,利用所述样本图像描述信息、所述训练后的样本图像描述生成信息或所述训练后的样本图像参考描述信息中,调整所述当前判别网络中卷积神经网络结构中的参数,得到所述训练后的判别网络。
作为一种可选的方案,所述训练单元通过以下步骤实现所述根据所述训练后的图像描述信息生成网络调整所述当前判别网络,得到训练后的判别网络包括:
S1,获取所述训练后的图像描述信息生成网络所输出的训练后的样本图像描述生成信息,或训练后的样本图像参考描述信息;
S2,利用所述样本图像描述信息、所述训练后的样本图像描述生成信息或所述训练后的样本图像参考描述信息中,调整所述当前判别网络中循环神经网络结构中的参数,得到所述训练后的判别网络。
需要说明的是,本申请实施例提供的图像描述信息生成装置中各个单元、 模块的具体实现可以参见本申请实施例提供的图像描述生成方法的相关描述。根据本申请实施例的又一个方面,还提供了一种用于实施上述图像描述信息生成方法的电子装置,如图15所示,该电子装置包括存储器1502和处理器1504,该存储器1502中存储有计算机程序,该处理器1504被设置为通过计算机程序执行上述任一项方法实施例中的步骤。
可选地,在本实施例中,上述电子装置可以位于计算机网络的多个网络设备中的至少一个网络设备。
可选地,在本实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:
S1,获取待处理的目标图像;
S2,将目标图像输入目标图像描述信息生成网络,其中,目标图像描述信息生成网络为利用多个样本图像进行对抗式训练后所得到的用于生成图像描述信息的生成网络,对抗式训练是基于与目标图像描述信息生成网络相匹配的初始化的图像描述信息生成网络,及初始化的判别网络,而进行的交替训练,判别网络用于判别图像描述信息生成网络的输出结果;
S3,根据目标图像描述信息生成网络的输出结果,生成用于描述目标图像的目标图像描述信息。
可选地,本领域普通技术人员可以理解,图15所示的结构仅为示意,电子装置也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图15其并不对上述电子装置的结构造成限定。例如,电子装置还可包括比图15中所示更多或者更少的组件(如网络接口等),或者具有与图15所示不同的配置。
其中,存储器1502可用于存储软件程序以及模块,如本申请实施例中的图像描述信息生成方法和装置对应的程序指令/模块,处理器1304通过运行存储在存储器1502内的软件程序以及模块,从而执行各种功能应用以及数据处 理,即实现上述的图像描述信息生成方法。存储器1302可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器1202可进一步包括相对于处理器1504远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。其中,存储器1502具体可以但不限于用于存储物品的样本特征与目标虚拟资源账号等信息。作为一种示例,如图15所示,上述存储器1502中可以但不限于包括上述图像描述信息生成方法装置中的获取单元1302、输入单元1304、生成单元1306及构建单元1402和训练单元1404。此外,还可以包括但不限于上述图像描述信息生成方法装置中的其他模块单元,本示例中不再赘述。
可选地,上述的传输装置1506用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置1506包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置1506为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
此外,上述电子装置还包括:显示器1508,用于显示上述待处理的目标图像及目标图像描述信息;和连接总线1510,用于连接上述电子装置中的各个模块部件。
根据本申请实施例的又一方面,还提供了一种存储介质,该存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的计算机程序:
S1,获取待处理的目标图像;
S2,将目标图像输入目标图像描述信息生成网络,其中,目标图像描述信息生成网络为利用多个样本图像进行对抗式训练后所得到的用于生成图像描述信息的生成网络,对抗式训练是基于与目标图像描述信息生成网络相匹配的初始化的图像描述信息生成网络,及初始化的判别网络,而进行的交替训练,判别网络用于判别图像描述信息生成网络的输出结果;
S3,根据目标图像描述信息生成网络的输出结果,生成用于描述目标图像的目标图像描述信息。
可选地,在本实施例中,本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令终端设备相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:闪存盘、只读存储器(Read-Only Memory,ROM)、随机存取器(Random Access Memory,RAM)、磁盘或光盘等。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。
在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的客户端,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽 略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述仅是本申请的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (15)

  1. 一种图像描述信息生成方法,应用于服务器,包括:
    获取待处理的目标图像;
    将所述目标图像输入目标图像描述信息生成网络,其中,所述目标图像描述信息生成网络为利用多个样本图像进行对抗式训练得到的用于生成图像描述信息的生成网络,所述对抗式训练是对初始化的图像描述信息生成网络和初始化的判别网络进行交替训练,所述判别网络用于判别所述图像描述信息生成网络的输出结果;
    根据所述目标图像描述信息生成网络的输出结果,生成用于描述所述目标图像的目标图像描述信息。
  2. 根据权利要求1所述的方法,在所述获取待处理的目标图像之前,还包括:
    构建所述初始化的图像描述信息生成网络及所述初始化的判别网络;
    对所述初始化的图像描述信息生成网络及所述初始化的判别网络进行对抗式训练,得到所述目标图像描述信息生成网络。
  3. 根据权利要求2所述的方法,构建所述初始化的判别网络包括:
    基于卷积神经网络、第一多层感知机及第一分类网络构建第一初始化判别网络,其中,所述第一多层感知机及所述第一分类网络用于将所述卷积神经网络输出的特征向量转化为概率值,所述卷积神经网络包括:M层卷积核,所述M层卷积核中第i层卷积核用于对所述样本图像的样本图像向量按照第i种尺寸进行卷积运算,所述i为小于等于M的正整数,所述样本图像向量是根据所述样本图像的图像特征向量及所述样本图像对应的样本图像描述信息中包含的词特征向量确定的;或者
    基于循环神经网络、第二多层感知机及第二分类网络构建第二初始化判别网络,其中,所述第二多层感知机及所述第二分类网络用于将所述循环神经网络输出的特征向量转化为概率值,所述循环神经网络包括:N层长短时记忆网络,所述N根据所述样本图像的样本图像向量确定,所述样本图像向量是根据所述样本图像的图像特征向量及所述样本图像对应的样本图像描述信息中包含的词特征向量确定的。
  4. 根据权利要求3所述的方法,构建所述初始化的图像描述信息生成网络包括:
    利用基于区域的卷积神经网络、注意力序列化语言模型及双层长短时记忆网络构建所述初始化的图像描述信息生成网络,其中,所述基于区域的卷积神经网络用于从所述样本图像中提取出局部特征向量和全局特征向量;所述注意力序列化语言模型用于对所述局部特征向量进行加权平均处理,得到平均特征向量;所述双层长短时记忆网络用于利用所述平均特征向量及所述全局特征向量得到待判别对象向量,并将所述待判别对象向量输入所述初始化的判别网络。
  5. 根据权利要求2所述的方法,,所述对所述初始化的图像描述信息生成网络,及所述初始化的判别网络进行对抗式训练,得到所述目标图像描述信息生成网络包括:
    获取所述样本图像及与所述样本图像对应的样本图像描述信息;
    将所述样本图像和所述样本图像描述信息,输入当前图像描述信息生成网络,得到与所述样本图像匹配的样本图像描述生成信息,或与所述样本图像匹配的样本图像参考描述信息,其中,所述样本图像描述生成信息与所述样本图像之间的第一匹配度,大于所述样本图像参考描述信息与所述样本图像之间的第二匹配度;其中,所述当前图像描述信息生成网络的初始值为所述初始化的图像描述信息生成网络,在训练过程中所述当前图像描述信息生成网络的网络参数是随着训练过程而被调整更新的;从所述样本图像描述信息、所述样本图像描述生成信息或所述样本图像参考描述信息中确定出待判别样本描述信息;
    将所述样本图像和所述待判别样本描述信息输入所述当前判别网络,得到样本判别概率值和样本反馈系数;
    在所述样本反馈系数指示所述样本判别概率值未达到收敛条件的情况下,根据所述样本判别概率值调整所述当前图像描述信息生成网络得到训练后的图像描述信息生成网络,并根据所述训练后的图像描述信息生成网络调整所述当前判别网络得到训练后的判别网络,并返回所述获取所述样本图像及与所述样本图像对应的样本图像描述信息的步骤,针对训练后的图像描述信息生成网络和训练后的判别网络继续进行训练,直到在所述样本反馈系数指示所述样本 判别概率值已达到所述收敛条件的情况下,将所述当前图像描述信息生成网络作为所述目标图像描述信息生成网络。
  6. 根据权利要求5所述的方法,在所述根据所述样本判别概率值调整所述当前图像描述信息生成网络,,并根据所述训练后的图像描述信息生成网络调整所述当前判别网络之前,还包括:
    确定所述当前判别网络输出的所述样本判别概率值;
    通过语言模型获取所述样本图像描述生成信息与所述样本图像之间的所述第一匹配度;
    对所述样本判别概率值及所述第一匹配度进行加权平均处理,得到所述样本反馈系数。
  7. 根据权利要求5所述的方法,所述根据所述样本判别概率值调整所述当前图像描述信息生成网络,包括:
    根据所述样本判别概率值调整所述当前图像描述信息生成网络中以下至少一种结构中的参数:当前基于区域的卷积神经网络、当前注意力序列化语言模型及当前双层长短时记忆网络。
  8. 根据权利要求5所述的方法,所述根据所述训练后的图像描述信息生成网络调整所述当前判别网络得到训练后的判别网络包括:
    获取所述训练后的图像描述信息生成网络所输出的训练后的样本图像描述生成信息或训练后的样本图像参考描述信息;
    利用所述样本图像描述信息、所述训练后的样本图像描述生成信息或所述训练后的样本图像参考描述信息中,调整所述当前判别网络中卷积神经网络结构中的参数,得到所述训练后的判别网络。
  9. 根据权利要求5所述的方法,所述根据所述训练后的图像描述信息生成网络调整所述当前判别网络得到训练后的判别网络包括:
    获取所述训练后的图像描述信息生成网络所输出的训练后的样本图像描述生成信息,或训练后的样本图像参考描述信息;
    利用所述样本图像描述信息、所述训练后的样本图像描述生成信息或所述训练后的样本图像参考描述信息中,调整所述当前判别网络中循环神经网络结构中的参数,得到所述训练后的判别网络。
  10. 一种图像描述信息生成装置,包括:
    获取单元,用于获取待处理的目标图像;
    输入单元,用于将所述目标图像输入目标图像描述信息生成网络,其中,所述目标图像描述信息生成网络为利用多个样本图像进行对抗式训练后所得到的用于生成图像描述信息的生成网络,所述对抗式训练是对初始化的图像描述信息生成网络和初始化的判别网络进行交替训练,所述判别网络用于判别所述图像描述信息生成网络的输出结果;
    生成单元,用于根据所述目标图像描述信息生成网络的输出结果,生成用于描述所述目标图像的目标图像描述信息。
  11. 根据权利要求10所述的装置,还包括:
    构建单元,用于在所述获取待处理的目标图像之前,构建所述初始化的图像描述信息生成网络及所述初始化的判别网络;
    训练单元,用于对所述初始化的图像描述信息生成网络及所述初始化的判别网络进行对抗式训练,得到所述目标图像描述信息生成网络。
  12. 根据权利要求11所述的装置,所述构建单元包括:
    第一构建模块,用于基于卷积神经网络、第一多层感知机及第一分类网络构建第一初始化判别网络,其中,所述第一多层感知机及所述第一分类网络用于将所述卷积神经网络输出的特征向量转化为概率值,所述卷积神经网络包括:M层卷积核,所述M层卷积核中第i层卷积核用于对所述样本图像的样本图像向量按照第i种尺寸进行卷积运算,所述i为小于等于M的正整数,所述样本图像向量是根据所述样本图像的图像特征向量及所述样本图像对应的样本图像描述信息中包含的词特征向量确定的;或者
    第二构建模块,用于基于循环神经网络、第二多层感知机及第二分类网络构建第二初始化判别网络,其中,所述第二多层感知机及所述第二分类网络用于将所述循环神经网络输出的特征向量转化为概率值,所述循环神经网络包括:N层长短时记忆网络,所述N根据所述样本图像的样本图像向量确定,所述样本图像向量是根据所述样本图像的图像特征向量及所述样本图像对应的样本图像描述信息中包含的词特征向量确定的。
  13. 根据权利要求12所述的装置,所述构建单元包括:
    第三构建模块,用于利用基于区域的卷积神经网络、注意力序列化语言模型及双层长短时记忆网络构建所述初始化的图像描述信息生成网络,其中,所述基于区域的卷积神经网络用于从所述样本图像中提取出局部特征向量和全局特征向量;所述注意力序列化语言模型用于对所述局部特征向量进行加权平均处理,得到平均特征向量;所述双层长短时记忆网络用于利用所述平均特征向量及所述全局特征向量得到待判别对象向量,并将所述待判别对象向量输入所述初始化的判别网络。
  14. 根据权利要求11所述的装置,所述训练单元包括:
    获取所述样本图像及与所述样本图像对应的样本图像描述信息;
    将所述样本图像和所述样本图像描述信息,输入当前图像描述信息生成网络,得到与所述样本图像匹配的样本图像描述生成信息,或与所述样本图像匹配的样本图像参考描述信息,其中,所述样本图像描述生成信息与所述样本图像之间的第一匹配度,大于所述样本图像参考描述信息与所述样本图像之间的第二匹配度;其中,所述当前图像描述信息生成网络的初始值为所述初始化的图像描述信息生成网络,在训练过程中所述当前图像描述信息生成网络的网络参数是随着训练过程而被调整更新的;
    从所述样本图像描述信息、所述样本图像描述生成信息或所述样本图像参考描述信息中确定出待判别样本描述信息;
    将所述样本图像和所述待判别样本描述信息输入所述当前判别网络,得到样本判别概率值和样本反馈系数;
    在所述样本反馈系数指示所述样本判别概率值未达到收敛条件的情况下,根据所述样本判别概率值调整所述当前图像描述信息生成网络,得到训练后的图像描述信息生成网络,并根据所述训练后的图像描述信息生成网络调整所述当前判别网络得到训练后的判别网络,并返回所述获取所述样本图像及与所述样本图像对应的样本图像描述信息的步骤,针对训练后的图像描述信息生成网络和训练后的判别网络继续进行训练,直到
    在所述样本反馈系数指示所述样本判别概率值已达到所述收敛条件的情况下,将所述当前图像描述信息生成网络作为所述目标图像描述信息生成网络。
  15. 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为通过所述计算机程序执行所述权利要求1至9任一项中所述的方法。
PCT/CN2019/111946 2018-11-30 2019-10-18 图像描述信息生成方法和装置及电子装置 WO2020108165A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19891662.9A EP3889836A4 (en) 2018-11-30 2019-10-18 METHOD AND DEVICE FOR GENERATING IMAGE DESCRIPTION INFORMATION AND ELECTRONIC DEVICE
US17/082,002 US11783199B2 (en) 2018-11-30 2020-10-27 Image description information generation method and apparatus, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811460241.9 2018-11-30
CN201811460241.9A CN109685116B (zh) 2018-11-30 2018-11-30 图像描述信息生成方法和装置及电子装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/082,002 Continuation US11783199B2 (en) 2018-11-30 2020-10-27 Image description information generation method and apparatus, and electronic device

Publications (1)

Publication Number Publication Date
WO2020108165A1 true WO2020108165A1 (zh) 2020-06-04

Family

ID=66185129

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/111946 WO2020108165A1 (zh) 2018-11-30 2019-10-18 图像描述信息生成方法和装置及电子装置

Country Status (4)

Country Link
US (1) US11783199B2 (zh)
EP (1) EP3889836A4 (zh)
CN (1) CN109685116B (zh)
WO (1) WO2020108165A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052906A (zh) * 2020-09-14 2020-12-08 南京大学 一种基于指针网络的图像描述优化方法
CN112529154A (zh) * 2020-12-07 2021-03-19 北京百度网讯科技有限公司 图像生成模型训练方法和装置、图像生成方法和装置
CN113706663A (zh) * 2021-08-27 2021-11-26 脸萌有限公司 图像生成方法、装置、设备及存储介质
CN114372537A (zh) * 2022-01-17 2022-04-19 浙江大学 一种面向图像描述系统的通用对抗补丁生成方法及系统
CN115035304A (zh) * 2022-05-31 2022-09-09 中国科学院计算技术研究所 一种基于课程学习的图像描述生成方法及系统
CN115098727A (zh) * 2022-06-16 2022-09-23 电子科技大学 基于视觉常识知识表征的视频描述生成方法
CN116543146A (zh) * 2023-07-06 2023-08-04 贵州大学 一种基于窗口自注意与多尺度机制的图像密集描述方法
CN118334604A (zh) * 2024-06-12 2024-07-12 海信集团控股股份有限公司 基于多模态大模型的事故检测、数据集构建方法及设备

Families Citing this family (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11568207B2 (en) * 2018-09-27 2023-01-31 Deepmind Technologies Limited Learning observation representations by predicting the future in latent space
CN109685116B (zh) 2018-11-30 2022-12-30 腾讯科技(深圳)有限公司 图像描述信息生成方法和装置及电子装置
JP7096361B2 (ja) * 2018-12-14 2022-07-05 富士フイルム株式会社 ミニバッチ学習装置とその作動プログラム、作動方法、および画像処理装置
CN110188620B (zh) * 2019-05-08 2022-11-04 腾讯科技(深圳)有限公司 对抗测试看图说话系统的方法和相关装置
CN111915339A (zh) * 2019-05-09 2020-11-10 阿里巴巴集团控股有限公司 数据的处理方法、装置及设备
CN110348877B (zh) * 2019-05-27 2023-11-14 上海大学 基于大数据的智能业务推荐算法、计算机可读存储介质
CN112150174B (zh) * 2019-06-27 2024-04-02 百度在线网络技术(北京)有限公司 一种广告配图方法、装置及电子设备
CN110458282B (zh) * 2019-08-06 2022-05-13 齐鲁工业大学 一种融合多角度多模态的图像描述生成方法及系统
CN110458829B (zh) * 2019-08-13 2024-01-30 腾讯医疗健康(深圳)有限公司 基于人工智能的图像质控方法、装置、设备及存储介质
CN110633655A (zh) * 2019-08-29 2019-12-31 河南中原大数据研究院有限公司 一种attention-attack人脸识别攻击算法
US11120268B2 (en) * 2019-08-30 2021-09-14 Microsoft Technology Licensing, Llc Automatically evaluating caption quality of rich media using context learning
CN110717421A (zh) * 2019-09-25 2020-01-21 北京影谱科技股份有限公司 一种基于生成对抗网络的视频内容理解方法及装置
CN110727868B (zh) * 2019-10-12 2022-07-15 腾讯音乐娱乐科技(深圳)有限公司 对象推荐方法、装置和计算机可读存储介质
CN111105013B (zh) * 2019-11-05 2023-08-11 中国科学院深圳先进技术研究院 对抗网络架构的优化方法、图像描述生成方法和系统
CN111046187B (zh) * 2019-11-13 2023-04-18 山东财经大学 基于对抗式注意力机制的一样本知识图谱关系学习方法及系统
CN110941945B (zh) * 2019-12-02 2021-03-23 百度在线网络技术(北京)有限公司 语言模型预训练方法和装置
CN111126282B (zh) * 2019-12-25 2023-05-12 中国矿业大学 一种基于变分自注意力强化学习的遥感图像内容描述方法
CN111159454A (zh) * 2019-12-30 2020-05-15 浙江大学 基于Actor-Critic生成式对抗网络的图片描述生成方法及系统
CN111428091B (zh) * 2020-03-19 2020-12-08 腾讯科技(深圳)有限公司 一种编码器的训练方法、信息推荐的方法以及相关装置
CN111639547B (zh) * 2020-05-11 2021-04-30 山东大学 基于生成对抗网络的视频描述方法及系统
CN112084841B (zh) * 2020-07-27 2023-08-04 齐鲁工业大学 跨模态的图像多风格字幕生成方法及系统
CN111916050A (zh) * 2020-08-03 2020-11-10 北京字节跳动网络技术有限公司 语音合成方法、装置、存储介质和电子设备
CN112102294B (zh) * 2020-09-16 2024-03-01 推想医疗科技股份有限公司 生成对抗网络的训练方法及装置、图像配准方法及装置
CN112329801B (zh) * 2020-12-03 2022-06-14 中国石油大学(华东) 一种卷积神经网络非局部信息构建方法
CN112529857B (zh) * 2020-12-03 2022-08-23 重庆邮电大学 基于目标检测与策略梯度的超声图像诊断报告生成方法
US11893792B2 (en) * 2021-03-25 2024-02-06 Adobe Inc. Integrating video content into online product listings to demonstrate product features
CN113378919B (zh) * 2021-06-09 2022-06-14 重庆师范大学 融合视觉常识和增强多层全局特征的图像描述生成方法
CN113392775B (zh) * 2021-06-17 2022-04-29 广西大学 一种基于深度神经网络的甘蔗幼苗自动识别与计数方法
CN113361628B (zh) * 2021-06-24 2023-04-14 海南电网有限责任公司电力科学研究院 一种多任务学习下的cnn绝缘子老化光谱分类方法
CN113673349B (zh) * 2021-07-20 2022-03-11 广东技术师范大学 基于反馈机制的图像生成中文文本方法、系统及装置
CN113792853B (zh) * 2021-09-09 2023-09-05 北京百度网讯科技有限公司 字符生成模型的训练方法、字符生成方法、装置和设备
CN114006752A (zh) * 2021-10-29 2022-02-01 中电福富信息科技有限公司 基于gan压缩算法的dga域名威胁检测系统及其训练方法
CN113779282B (zh) * 2021-11-11 2022-01-28 南京码极客科技有限公司 基于自注意力和生成对抗网络的细粒度跨媒体检索方法
CN116152690A (zh) * 2021-11-17 2023-05-23 瑞昱半导体股份有限公司 视频分类系统与方法以及神经网络训练系统与方法
US20230153522A1 (en) * 2021-11-18 2023-05-18 Adobe Inc. Image captioning
CN114117682B (zh) * 2021-12-03 2022-06-03 湖南师范大学 齿轮箱的故障识别方法、装置、设备及存储介质
CN114386569B (zh) * 2021-12-21 2024-08-23 大连理工大学 一种使用胶囊网络的新型图像描述生成方法
CN114255386A (zh) * 2021-12-23 2022-03-29 国家电网有限公司信息通信分公司 一种数据处理方法及装置
US12100082B2 (en) 2022-11-09 2024-09-24 Mohamed bin Zayed University of Artificial Intelligence System and method of cross-modulated dense local fusion for few-shot image generation
CN116312861A (zh) * 2023-05-09 2023-06-23 济南作为科技有限公司 脱硝系统气体浓度预测方法、装置、设备及存储介质
CN116629346B (zh) * 2023-07-24 2023-10-20 成都云栈科技有限公司 一种语言模型训练方法及装置
CN116912639B (zh) * 2023-09-13 2024-02-09 腾讯科技(深圳)有限公司 图像生成模型的训练方法和装置、存储介质及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160358024A1 (en) * 2015-06-03 2016-12-08 Hyperverge Inc. Systems and methods for image processing
CN107330444A (zh) * 2017-05-27 2017-11-07 苏州科技大学 一种基于生成对抗网络的图像自动文本标注方法
CN107451994A (zh) * 2017-07-25 2017-12-08 宸盛科华(北京)科技有限公司 基于生成对抗网络的物体检测方法及装置
CN108334497A (zh) * 2018-02-06 2018-07-27 北京航空航天大学 自动生成文本的方法和装置
CN109685116A (zh) * 2018-11-30 2019-04-26 腾讯科技(深圳)有限公司 图像描述信息生成方法和装置及电子装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844442A (zh) * 2016-12-16 2017-06-13 广东顺德中山大学卡内基梅隆大学国际联合研究院 基于fcn特征提取的多模态循环神经网络图像描述方法
US10387776B2 (en) * 2017-03-10 2019-08-20 Adobe Inc. Recurrent neural network architectures which provide text describing images
CN107133354B (zh) 2017-05-25 2020-11-10 北京小米移动软件有限公司 图像描述信息的获取方法及装置
CN109754357B (zh) * 2018-01-26 2021-09-21 京东方科技集团股份有限公司 图像处理方法、处理装置以及处理设备
CN108564550B (zh) * 2018-04-25 2020-10-02 Oppo广东移动通信有限公司 图像处理方法、装置及终端设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160358024A1 (en) * 2015-06-03 2016-12-08 Hyperverge Inc. Systems and methods for image processing
CN107330444A (zh) * 2017-05-27 2017-11-07 苏州科技大学 一种基于生成对抗网络的图像自动文本标注方法
CN107451994A (zh) * 2017-07-25 2017-12-08 宸盛科华(北京)科技有限公司 基于生成对抗网络的物体检测方法及装置
CN108334497A (zh) * 2018-02-06 2018-07-27 北京航空航天大学 自动生成文本的方法和装置
CN109685116A (zh) * 2018-11-30 2019-04-26 腾讯科技(深圳)有限公司 图像描述信息生成方法和装置及电子装置

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112052906A (zh) * 2020-09-14 2020-12-08 南京大学 一种基于指针网络的图像描述优化方法
CN112052906B (zh) * 2020-09-14 2024-02-02 南京大学 一种基于指针网络的图像描述优化方法
CN112529154A (zh) * 2020-12-07 2021-03-19 北京百度网讯科技有限公司 图像生成模型训练方法和装置、图像生成方法和装置
CN113706663A (zh) * 2021-08-27 2021-11-26 脸萌有限公司 图像生成方法、装置、设备及存储介质
CN113706663B (zh) * 2021-08-27 2024-02-02 脸萌有限公司 图像生成方法、装置、设备及存储介质
CN114372537A (zh) * 2022-01-17 2022-04-19 浙江大学 一种面向图像描述系统的通用对抗补丁生成方法及系统
CN114372537B (zh) * 2022-01-17 2022-10-21 浙江大学 一种面向图像描述系统的通用对抗补丁生成方法及系统
CN115035304A (zh) * 2022-05-31 2022-09-09 中国科学院计算技术研究所 一种基于课程学习的图像描述生成方法及系统
CN115098727A (zh) * 2022-06-16 2022-09-23 电子科技大学 基于视觉常识知识表征的视频描述生成方法
CN116543146A (zh) * 2023-07-06 2023-08-04 贵州大学 一种基于窗口自注意与多尺度机制的图像密集描述方法
CN116543146B (zh) * 2023-07-06 2023-09-26 贵州大学 一种基于窗口自注意与多尺度机制的图像密集描述方法
CN118334604A (zh) * 2024-06-12 2024-07-12 海信集团控股股份有限公司 基于多模态大模型的事故检测、数据集构建方法及设备

Also Published As

Publication number Publication date
EP3889836A4 (en) 2022-01-26
CN109685116A (zh) 2019-04-26
CN109685116B (zh) 2022-12-30
US11783199B2 (en) 2023-10-10
US20210042579A1 (en) 2021-02-11
EP3889836A1 (en) 2021-10-06

Similar Documents

Publication Publication Date Title
WO2020108165A1 (zh) 图像描述信息生成方法和装置及电子装置
US20230196117A1 (en) Training method for semi-supervised learning model, image processing method, and device
CN109902546B (zh) 人脸识别方法、装置及计算机可读介质
JP7096444B2 (ja) 画像領域位置決め方法、モデル訓練方法及び関連装置
WO2021159714A1 (zh) 一种数据处理方法及相关设备
WO2019196633A1 (zh) 一种图像语义分割模型的训练方法和服务器
WO2024045444A1 (zh) 一种视觉问答任务的处理方法、装置、设备和非易失性可读存储介质
CN107480144B (zh) 具备跨语言学习能力的图像自然语言描述生成方法和装置
WO2023109714A1 (zh) 用于蛋白质表征学习的多模态信息融合方法、系统、终端及存储介质
CN110489567B (zh) 一种基于跨网络特征映射的节点信息获取方法及其装置
CN109919078A (zh) 一种视频序列选择的方法、模型训练的方法及装置
WO2021088935A1 (zh) 对抗网络架构的优化方法、图像描述生成方法和系统
CN112395979A (zh) 基于图像的健康状态识别方法、装置、设备及存储介质
CN113039555A (zh) 通过使用基于注意力的神经网络在视频剪辑中进行动作分类
US20200372639A1 (en) Method and system for identifying skin texture and skin lesion using artificial intelligence cloud-based platform
WO2020151175A1 (zh) 文本生成方法、装置、计算机设备及存储介质
WO2023231954A1 (zh) 一种数据的去噪方法以及相关设备
WO2023231753A1 (zh) 一种神经网络的训练方法、数据的处理方法以及设备
US20240232575A1 (en) Neural network obtaining method, data processing method, and related device
CN111091010A (zh) 相似度确定、网络训练、查找方法及装置和存储介质
WO2021036397A1 (zh) 目标神经网络模型的生成方法和装置
CN112529149A (zh) 一种数据处理方法及相关装置
WO2022063076A1 (zh) 对抗样本的识别方法及装置
WO2024114659A1 (zh) 一种摘要生成方法及其相关设备
CN108154165B (zh) 基于大数据与深度学习的婚恋对象匹配数据处理方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19891662

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019891662

Country of ref document: EP

Effective date: 20210630