WO2019042244A1 - 图像描述生成方法、模型训练方法、设备和存储介质 - Google Patents

图像描述生成方法、模型训练方法、设备和存储介质 Download PDF

Info

Publication number
WO2019042244A1
WO2019042244A1 PCT/CN2018/102469 CN2018102469W WO2019042244A1 WO 2019042244 A1 WO2019042244 A1 WO 2019042244A1 CN 2018102469 W CN2018102469 W CN 2018102469W WO 2019042244 A1 WO2019042244 A1 WO 2019042244A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
feature vector
image
description information
training
Prior art date
Application number
PCT/CN2018/102469
Other languages
English (en)
French (fr)
Inventor
姜文浩
马林
刘威
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019042244A1 publication Critical patent/WO2019042244A1/zh
Priority to US16/548,621 priority Critical patent/US11270160B2/en
Priority to US17/589,726 priority patent/US11907851B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the embodiments of the present application relate to the field of machine learning technologies, and in particular, to an image description generation method, a model training method, a device, and a storage medium.
  • the content information of an image can be converted into a textual description of the image by machine readable instructions.
  • the embodiment of the present application provides an image description generation method, a model training method, a terminal, and a storage medium.
  • An embodiment of the present application provides an image description generation method, where the method is applied to a computing device, and includes the following steps:
  • the embodiment of the present application provides a model training method for training a matching model and a computing model.
  • the method is applied to a computing device, and includes the following steps:
  • a matching model is trained according to the global feature vector and the text feature vector.
  • An embodiment of the present application provides a generating device, where the generating device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one instruction A program, the set of codes, or a set of instructions is loaded and executed by the processor to implement an image description generation method as described above.
  • the embodiment of the present application provides a training device, where the training device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one instruction A program, the set of codes, or a set of instructions is loaded and executed by the processor to implement the model training method as described above.
  • An embodiment of the present application provides a computer storage medium, where the storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program, the code set, or The set of instructions is loaded and executed by the processor to implement the image description generation method as described above.
  • An embodiment of the present application provides a computer storage medium, where the storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program, the code set, or The set of instructions is loaded and executed by the processor to implement the model training method as described above.
  • FIG. 1 is a schematic diagram of an implementation environment involved in an image description generation method and a model training method provided by various embodiments of the present application;
  • FIG. 2 is a flowchart of a method for training a model provided by an embodiment of the present application
  • FIG. 3 is a flowchart of a method for generating an image description provided by an embodiment of the present application.
  • FIG. 5A is a schematic structural diagram of an image description generating apparatus according to an embodiment of the present application.
  • FIG. 5B is a schematic structural diagram of an image description generating apparatus according to an embodiment of the present application.
  • FIG. 5C is a schematic structural diagram of an image description generating apparatus according to an embodiment of the present application.
  • FIG. 6A is a schematic structural diagram of a model training apparatus according to an embodiment of the present application.
  • 6B is a schematic structural diagram of a model training apparatus according to an embodiment of the present application.
  • FIG. 6C is a schematic structural diagram of a model training apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a server according to an embodiment of the present application.
  • CNN Convolution Neural Network
  • RNN Recurrent Neural Network
  • RNN is a neural network with fixed weights, external inputs and internal states. It can be regarded as a parameter based on weights and external inputs. Behavioral dynamics about internal states. .
  • RNN is the most commonly used implementation model for decoders, and is responsible for translating the image vectors generated by the encoder into textual descriptions of the images.
  • LSTM Long-Short-Term Memory
  • the RNN with attention mechanism will process some pixels of the target image of the previous state according to the current state of the target image, instead of processing according to all the pixels of the target image. Reduce the processing complexity of tasks.
  • SGD Spochastic Gradient Descent
  • the Cross-Entropy Cost Function is a method for calculating the error between the predicted distribution and the actual distribution of the neural network. In the process of back propagation training the neural network, if the prediction distribution and the actual distribution The greater the error between them, the greater the adjustment of the various parameters of the neural network.
  • the image description generation method may include the following steps: first, encoding an acquired target image by an encoder, such as a feature extraction model, generating a global feature vector and a label vector set of the target image, and then inputting the target image.
  • the global feature vector and the annotation vector are collected into a decoder, such as a calculation model, and finally the description information of the target image is obtained.
  • the input parameters of the decoder only include the global feature vector and the annotation vector set of the target image, that is, the input parameters of the decoder only include the image information of the target image, which easily leads to the generated image description information. Not accurate enough.
  • the embodiment of the present application provides an image description generation method, a model training method, a terminal, and a storage medium, which can improve the accuracy of the generated image description information. Specific technical solutions will be described in detail below.
  • FIG. 1 is a schematic diagram of an implementation environment involved in an image description generation method and a model training method provided by various embodiments of the present application.
  • the implementation environment includes: a training device 110 and a generation device 120 . .
  • Training device 110 refers to a device for training a description generation model.
  • the description generation model is configured to generate description information of the training image based on the training image and the reference image description information corresponding thereto.
  • the training device 110 can be a computing device such as a computer terminal, a mobile terminal, and a server that can implement complex algorithms.
  • the description generation model includes a feature extraction model, a matching model, and a calculation model.
  • the feature extraction model is configured to generate a global feature vector and a set of annotation vectors of the training image according to the training image, and generate a corresponding text feature vector according to the reference image description information of the training image;
  • the matching model is used according to the global feature acquired by the feature extraction model Vector and text feature vectors generate multimodal eigenvectors of the training image;
  • the computational model is used to generate multi-modal feature vectors generated from the matching model, and global feature vectors and annotation vector sets generated by the feature extraction model to generate description information of the training images.
  • the training device 110 continuously trains the calculation model in the generated model according to the generated description information and the reference image description information of the training image.
  • the generating device 120 refers to a device for generating description information of a model generation target image according to the description.
  • the generating device 120 can be a computing device such as a computer terminal, a mobile terminal, and a server that can implement complex algorithms.
  • the training device 110 and the generating device 120 may be the same device or different devices. If the training device 110 and the generating device 120 are the same device, the description generating model in the generating device 120 is a model that is pre-trained and stored by itself; if the training device 110 and the generating device 120 are different devices, the generating device 120 is The description generation model may be a model trained by the training device 110 obtained from the training device 110.
  • FIG. 2 is a flowchart of a method for training a model provided by an embodiment of the present application. This embodiment is illustrated by using the model training method in the training device shown in FIG. 1. As shown in FIG. 2, the model training method may include the following steps:
  • Step 201 Acquire a global feature vector and a set of annotation vectors of the training image, and a text feature vector of the reference image description information of the training image.
  • the training image is a preset image for training
  • the global feature vector is a vector having a preset length describing the overall feature of the training image
  • the annotation vector set is a set of vectors describing a plurality of sub-region features of the training image
  • the reference image description information is text information set in advance for describing the corresponding training image.
  • the training image may include at least one image, and in actual implementation, in order to increase the training sample and improve the training accuracy, the training image may include multiple sheets, and the reference image description information of each training image may be 3 to 5 sentences.
  • Each statement is a statement that can describe the complete content of the training image separately.
  • the global feature vector and the annotation vector set of the training image and the text feature vector of the reference image description information of the training image may be acquired by the feature extraction model.
  • the feature extraction model comprises two parts, wherein the step of acquiring the global feature vector and the annotation vector set comprises: encoding the training image by the first part of the feature extraction model, generating a global feature vector and a label vector set of the training image; acquiring the text feature
  • the step of vectoring comprises: encoding the reference image description information of the training image by the second part of the feature extraction model to generate a corresponding text feature vector.
  • the first part of the feature extraction model may be a pre-trained CNN, and the CNN includes multiple convolution layers and multiple fully connected layers, and the global feature vector may be generated by the last fully connected layer of the CNN. And generate a set of annotation vectors through the fourth convolutional layer of CNN.
  • the first part is a VGG (Visual Geometry Group) network.
  • the fully connected layer is the network layer to which each neuron of the output layer and each neuron of the input layer are connected.
  • the second portion of the feature extraction model may encode the reference image description information of the training image by Fisher Vector Fisher Vector technology.
  • the global feature vector and the annotation vector set may be generated by the first part, and then the text feature vector is generated by the second part, or the text feature vector may be generated by the second part, and then the global feature vector and the annotation are generated by the first part.
  • the vector set may also generate a text feature vector through the second part while generating the global feature vector and the annotation vector set through the first part.
  • Step 202 Train the matching model according to the global feature vector and the text feature vector, and generate a multimodal feature vector of the training image by using the trained matching model.
  • the matching model includes two parts, wherein the first part is used to convert the global feature vector into a global feature matching vector, and the second part is used to convert the text feature vector into a text feature matching vector.
  • the first portion of the matching model may be a first neural network
  • the second portion of the matching model may be a second neural network
  • the first neural network, and/or the second neural network may be Fully connected to a multi-layer neural network.
  • the following is an example in which the first part of the matching model is the first neural network and the second part is the second neural network.
  • a ranking loss function (Rank-Loss) method may be used to obtain a target loss function of a global feature matching vector and a text feature matching vector in a distribution, and the target loss function is processed by SGD.
  • the condition for judging whether the matching model is completed includes: detecting whether the value of the target loss function changes during the training process; if the value of the target loss function is unchanged, the matching model is trained.
  • the training image is again input to the first neural network to obtain a multi-modal feature vector of the training image.
  • Step 203 Input the multimodal feature vector, the global feature vector, and the annotation vector set to the calculation model to obtain image description information of the training image.
  • the calculation model includes n deep networks, and n is a positive integer.
  • the step includes: generating image description information according to the multimodal feature vector, the global feature vector, the annotation vector set, and the n depth networks.
  • the input parameter of the at least one depth network of the n depth networks includes a splicing vector.
  • the splicing vector is the vector obtained by splicing the output vector of the i-1th depth network and the multimodal eigenvector, 1 ⁇ i ⁇ n, ie, i is greater than or equal to 1, and Less than or equal to n.
  • the input parameters of the deep network include a splicing vector, and the splicing vector is a vector obtained by splicing the multi-modal feature vector and the annotation vector set; for example, the calculation model includes three deep networks, The input parameter of the third deep network includes a splicing vector, and the splicing vector is a vector obtained by splicing the output vector of the second depth network and the multi-modal feature vector.
  • the n deep networks may be an LSTM with a attention mechanism, a GRU (Gated Recurrent Unit), or other RNNs.
  • the specific steps of the image description information include:
  • the multi-modal feature vector M and the annotation vector set A are spliced to obtain a first splicing vector A'.
  • the multi-modal feature vector M and the annotation vector set A are spliced only for the formal addition splicing.
  • the length of the multi-modal feature vector M is n1
  • the length of the annotation vector set A is n2
  • the first splicing The length of the vector A' is n1+n2.
  • the label vector set A is usually placed above, and the multi-modal feature vector M is placed below.
  • the first stitching vector A' and the global feature vector are input to the first depth network to obtain a first output vector h(t).
  • h(t) is the implicit state of the LSTM after the current time step t, that is, the output value of the activation function of the LSTM intermediate hidden layer memory unit
  • h(t-1) is after the previous time step t-1
  • the implied state of LSTM is the implicit state of LSTM.
  • the first output vector h(t) and the multi-modal feature vector M are spliced to obtain a second splicing vector A".
  • the first output vector h(t) and the multi-modal feature vector M are spliced, which is similar to the method in the first step, and is also a formal addition splicing, and will not be described again here.
  • the position of the multi-modal feature vector M should be consistent during the two stitching processes.
  • the two stitchings are to place the multi-modal feature vector M below.
  • the second splicing vector A" is input to the second depth network to obtain image description information.
  • a linear regression method is used to predict the next generated word in the image description information to obtain a corresponding Chinese character, and finally image description information is obtained.
  • the image description information is a statement that can describe the complete content of the training image separately.
  • the linear regression method may be a Softmax regression method.
  • Step 204 If the reference image description information and the generated image description information do not match, the calculation model is trained according to the image description information and the reference image description information.
  • the cross entropy cost function is used as the loss function to calculate the distribution of the predicted word and the distribution of the real word.
  • the error between the variables and the various parameters in the calculation model are continuously adjusted by SGD, and the calculation model is optimized, that is, the calculation model is trained until the value of the loss function no longer changes, that is, the error values of the two cannot be reduced. small.
  • the model training method trains the matching model according to the training image and the corresponding reference image description information, so that the multi-modal feature vector generated by the trained matching model includes the predicted text. Information; then input the multimodal eigenvector containing the predicted text information to the calculation model, so that the description information of the training image generated by the calculation model is more accurate; finally, the calculation model is trained according to the more accurate description information and the reference image description information. The effect of improving the accuracy of describing the image description information generated by the generated model is achieved.
  • FIG. 3 is a flowchart of a method for generating an image description provided by an embodiment of the present application. This embodiment is exemplified by the image description generating method used in the generating device shown in FIG. 1. As shown in FIG. 3, the image description generation method may include the following steps:
  • the generating device acquires the description generation model.
  • the step of generating the description generation model by the generating device may include: the generating device sending the obtaining request to the training device, receiving the description generating model returned by the training device, or the generating device receiving the description generated by the training device model.
  • Step 301 Acquire a target image.
  • a pre-stored target image can be read.
  • the target image may be an image collected and saved by the generating device itself, or may be an image obtained and saved from other devices in advance, or may be an image downloaded and saved from the network in advance.
  • the generating device may also send an image acquisition request to other devices, receive a target image returned by other devices, or receive a target image that is actively sent by other devices.
  • the target image is generally different from the training image.
  • Step 302 Generate a first global feature vector and a first annotation vector set of the target image.
  • the target image Inputting the target image to the feature extraction model, wherein in the process of generating the image description, the target image is only encoded by the first part of the feature extraction model, and the first global feature vector and the first annotation vector set of the target image are generated. can.
  • Step 303 Input a target image to the matching model, and generate a first multi-modal feature vector of the target image by using the matching model;
  • the matching model is a model trained according to the reference image description information of the training image and the training image.
  • the target image is encoded by the first part of the trained matching model to generate a first multimodal feature vector of the target image. Since the matching model in this embodiment is the matching model that has been trained in the above embodiment, the generated multimodal feature vector contains the predicted text information.
  • Step 304 Generate target image description information of the target image according to the first multimodal feature vector, the first global feature vector, and the first annotation vector set.
  • the target image description information is obtained by inputting the first multimodal feature vector, the first global feature vector, and the first annotation vector set to the calculation model.
  • the calculation model in this embodiment is a calculation model trained according to the image description information of the training image and the reference image description information in the above embodiment.
  • the calculation model includes n depth networks, where n is a positive integer, and the step includes: according to the first multimodal feature vector, the first global feature vector, the first annotation vector set, and the n Deep network, generating image description information.
  • the input parameter of the at least one depth network of the n deep networks includes a splicing vector.
  • the splicing vector is the first multi-modal eigenvector and The vector obtained by splicing the first annotation vector set, if i>1, the splicing vector is a vector obtained by splicing the output vector of the i-1th depth network and the first multimodal eigenvector, 1 ⁇ i ⁇ n.
  • the n deep networks may be LSTM with attention mechanism or other RNNs.
  • specific steps for generating image description information include:
  • the first multi-modal feature vector M and the first annotation vector set A are spliced to obtain a first splicing vector A'.
  • the first stitching vector A' and the first global feature vector are input to the first depth network to obtain a first output vector h(t).
  • the first output vector h(t) and the first multi-modal feature vector M are spliced to obtain a second splicing vector A".
  • the second stitching vector A" is input to the second depth network to obtain target image description information.
  • h(t) LSTM(x(t),h(t-1),A′′)
  • the difference is that, at each time step t, the distribution of the next generated Chinese character in the target image description information is calculated according to the output h(t) of the second depth network, and then through the greedy search algorithm or the bundle search ( The beam search algorithm determines the next generated Chinese character and uses it as the input vector x(t) of the function h(t) at the next time step t+1. After continuous recursive operation, the complete target image is finally obtained. Description.
  • step 304 is similar to step 203 in the foregoing embodiment, and details are not described herein again.
  • the image description generation method provided by the above embodiment is generally used in a device that needs to have an image retrieval function. After the description of the generation model is completed, the device collects a large number of target images, generates a corresponding target image description information for each target image through the trained description generation model, and uses the target image and the target image description information to One-to-one correspondence is stored in the device's database.
  • the input image description information may be at least one keyword describing the image content, or a complete sentence. description.
  • the device will search for the target image description information related to the target image according to the image description information input by the user, and then find the corresponding target image, and provide the found target image to the user.
  • the calculation model generally includes two attention mechanism-based LSTM networks. As shown in FIG. 4, the specific steps of the above image description generation method are exemplified below with the target image as the image 1.
  • Image 1 is an image acquired by the device in daily life, and the image content of the image 1 is "a group of people sitting around the table taking a photo”.
  • the target image is input to the feature extraction model, and the target image is encoded by the first part of the feature extraction model, and the global feature vector and the annotation vector set of the target image are obtained.
  • the target image is input to the trained matching model, and the target image is encoded by the first part of the matching model to obtain a multimodal feature vector of the target image.
  • the matching model is a model trained according to the reference image description information of the training image and the training image.
  • the multi-modal feature vector outputted by the matching model and the annotation vector set output by the feature extraction model are spliced to obtain a first splicing vector, and the first splicing vector and the global eigenvector are input to the first depth network, and the first An output vector.
  • the multi-modal feature vector and the first output vector are spliced to obtain a second splicing vector, and the second splicing vector is input to the second depth network.
  • the distribution of the next generated Chinese character in the target image description information is calculated according to the output vector h(t), and the next generated Chinese character is determined by the greedy search algorithm or the beam search algorithm, and
  • the distribution of the first generated word is calculated according to the output vector h(1), and then
  • the first word generated by the algorithm is determined to be "one", and "one” is used as the input vector x(2) of the function h(2) in the second time step, and the second is also calculated according to the output vector h(2)
  • the distribution of the generated words, and then the algorithm determines that the second generated word is "group", and "group” as the input vector x(3) of the function h(3) in the third time step, and so
  • the image description generation method obtains a multi-modal feature vector of the target image by inputting the target image to the matching model, and the matching model is trained according to the reference image description information of the training image and the training image.
  • the model so the multimodal feature vector generated by the matching model contains the predicted text information; and then the multimodal feature vector containing the predicted text information is input to the calculation model to obtain the target image description information of the target image, so that the generated image
  • the target image description information is more accurate, and the effect of improving the accuracy of the generated image description information is achieved.
  • step 303 may be performed first, and then step 302 is performed. Step 303 may also be performed at the same time as step 302.
  • the description generation model provided by each embodiment described above is used in English, the target image description information generated by the description generation model is described in English, and when the user needs to query one or some For some images, the image description information of the image of the desired query input to the device is also a keyword or a text description in English. Therefore, the process of generating the target image description information may be slightly changed.
  • the following still uses the target image as the image 1, and the calculation model is an LSTM network with two attention mechanisms. The specific steps include:
  • the target image is acquired, and the image 1 is an image acquired by the device in daily life, and the image content of the image 1 is “a group of people are sitting at the table and taking a photo”.
  • the target image is input to the feature extraction model, and the target image is encoded by the first part of the feature extraction model to generate a global feature vector and a set of annotation vectors of the target image.
  • the target image is input to the trained matching model, and the target image is encoded by the first part of the matching model to generate a multimodal feature vector of the target image.
  • the matching model is a model trained according to the reference image description information of the training image and the training image.
  • the multi-modal feature vector outputted by the matching model and the annotation vector set output by the feature extraction model are spliced to obtain a first splicing vector, and the first splicing vector and the global eigenvector are input to the first depth network, and the first An output vector.
  • the multi-modal feature vector and the first output vector are spliced to obtain a second splicing vector, and the second splicing vector is input to the second depth network.
  • the distribution of the next generated English word in the target image description information is calculated according to the output vector h(t), and the next generated English word is determined by the greedy search algorithm or the beam search algorithm, and As the input vector x(t) of the function h(t) at the next time step t+1, for example, at the first time step, the distribution of the first generated English word is calculated according to the output vector h(1), Then the algorithm determines that the first generated English word is "a", and "a” is used as the input vector x(2) of the function h(2) in the second time step, and is also calculated according to the output vector h(2).
  • FIG. 5A is a schematic structural diagram of an image description generating apparatus provided by an embodiment of the present application.
  • the image description generating apparatus may include: an obtaining module 510 and a generating module 520.
  • An obtaining module 510 configured to acquire a target image
  • a generating module 520 configured to generate a first global feature vector and a first annotation vector set of the target image
  • the generating module 520 is further configured to input the target image acquired by the acquiring module 510 to a matching model, and generate a first multi-modal feature vector of the target image by using the matching model; Training the model according to the training image and the reference image description information of the training image;
  • the generating module 520 is further configured to generate target image description information of the target image according to the first multimodal feature vector, the first global feature vector, and the first annotation vector set.
  • the image description generating apparatus obtains a multi-modal feature vector of the target image by inputting the target image to the matching model, and the matching model is trained according to the reference image description information of the training image and the training image.
  • the model so the multimodal feature vector generated by the matching model contains the predicted text information; and the multimodal feature vector containing the predicted text information is input to the calculation model, so that the target image description of the target image generated by the calculation model is The information is more accurate and achieves the effect of improving the accuracy of the generated image description information.
  • the generating module 520 is further configured to input the first multimodal feature vector, the first global feature vector, and the first annotation vector set to a computing model to obtain the target image description information;
  • the model is a model trained based on the image description information of the training image and the reference image description information.
  • the computational model includes n deep networks, n being a positive integer.
  • the generating module 520 is further configured to generate the target image description information according to the first multimodal feature vector, the first global feature vector, the first annotation vector set, and the n depth networks. ;
  • the input parameter of the at least one depth network of the n depth networks includes a splicing vector.
  • the splicing vector is the a vector obtained by splicing the first multimodal feature vector and the first annotation vector set, if i>1, the stitching vector is an output vector of the i-1th depth network and the first multimodal feature The vector obtained by vector stitching, 1 ⁇ i ⁇ n.
  • the device further includes: a splicing module 530, as shown in FIG. 5B.
  • a splicing module 530 configured to splicing the first multi-modal feature vector and the first annotation vector set to obtain a first splicing vector
  • the generating module 520 is further configured to input the first stitching vector and the first global feature vector to the first depth network to obtain a first output vector;
  • the splicing module 530 is further configured to splicing the first output vector and the first multi-modal feature vector to obtain a second splicing vector;
  • the generating module 520 is further configured to input the second stitching vector to the second depth network to obtain the target image description information.
  • the apparatus further includes: a training module 540, as shown in FIG. 5C.
  • the acquiring module 510 is further configured to acquire a second global feature vector and a second annotation vector set of the training image, and a text feature vector of the reference image description information of the training image;
  • the training module 540 is configured to train the matching model according to the second global feature vector and the text feature vector.
  • the generating module 520 is further configured to generate a second multi-modal feature vector of the training image by using a training model obtained by training;
  • the generating module 520 is further configured to input the second multimodal feature vector, the second global feature vector, and the second annotation vector set to a calculation model to obtain image description information of the training image;
  • the training module 540 is further configured to train the calculation model according to the image description information and the reference image description information when the reference image description information and the generated image description information do not match.
  • the image description generating apparatus provided by the foregoing embodiment is only illustrated by the division of each functional module. In actual applications, the function allocation may be completed by different functional modules according to requirements, that is, the server or the terminal.
  • the internal structure of the computing device is divided into different functional modules to perform all or part of the functions described above.
  • the embodiment of the image description generating device and the image description generating method provided by the foregoing embodiments are in the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • FIG. 6A is a schematic structural diagram of a model training apparatus according to an embodiment of the present application.
  • the model training apparatus is used to train a matching model and a calculation model as described in the foregoing embodiments.
  • the apparatus can include an acquisition module 610 and a training module 620.
  • An obtaining module 610 configured to acquire a global feature vector and a label vector set of the training image, and a text feature vector of the reference image description information of the training image;
  • the training module 620 is configured to train the matching model according to the global feature vector and the text feature vector.
  • the model training apparatus trains the matching model according to the training image and the reference image description information corresponding thereto, so that the multi-modal feature vector generated by the trained matching model includes the predicted text. Information; then input the multi-modal feature vector containing the predicted text information to the calculation model, so that the description information of the training image generated by the calculation model is more accurate; finally, according to the description information of the more accurate training image and the reference of the training image
  • the image description information trains the calculation model; the effect of improving the accuracy of describing the image description information generated by the model is achieved.
  • the device further includes: a generating module 630, as shown in FIG. 6B.
  • a generating module 630 configured to generate a multi-modal feature vector of the training image by using a training model obtained by training;
  • the generating module 630 is further configured to input the multi-modal feature vector, the global feature vector, and the annotation vector set to a calculation model to obtain image description information of the training image;
  • the training module 620 is further configured to: when the reference image description information and the generated image description information do not match, train the calculation model according to the image description information and the reference image description information
  • the calculation model includes n deep networks, and n is a positive integer.
  • the generating module 630 is further configured to generate the image description information according to the multimodal feature vector, the global feature vector, the annotation vector set, and the n depth networks;
  • the input parameter of the at least one depth network of the n depth networks includes a splicing vector.
  • the splicing vector is the a vector obtained by splicing a multi-modal feature vector and the set of annotation vectors, if i>1, the stitching vector is a vector obtained by stitching an output vector of the i-1th depth network and the multi-modal feature vector. 1 ⁇ i ⁇ n.
  • the device further includes: a splicing module 640, as shown in FIG. 6C.
  • a splicing module 640 configured to splicing the multi-modal feature vector and the annotation vector set to obtain a first splicing vector
  • the generating module 630 is further configured to input the first stitching vector and the global feature vector to the first depth network to obtain a first output vector;
  • the splicing module 640 is further configured to splicing the first output vector and the multi-modal feature vector to obtain a second splicing vector;
  • the generating module 630 is further configured to input the second stitching vector to the second depth network to obtain the image description information.
  • model training device provided by the foregoing embodiment is only illustrated by the division of each functional module. In actual applications, the function allocation may be completed by different functional modules according to requirements, that is, a server or a terminal. The internal structure of the computing device is divided into different functional modules to perform all or part of the functions described above.
  • embodiment of the model training device and the model training method provided by the above embodiments are the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • the embodiment of the present application further provides a computer readable storage medium, which may be a computer readable storage medium included in the memory, or may be a computer that is not separately installed in the terminal or the server. Read the storage medium.
  • the computer readable storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, and, when the computer readable storage medium is used in a device, the at least one instruction, the at least one program, The code set or instruction set is loaded and executed by a processor to implement the image description generation method in the above embodiment;
  • the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement model training in the above embodiments method.
  • FIG. 7 is a block diagram of a terminal 700 provided by an embodiment of the present application.
  • the terminal may include a radio frequency (RF) circuit 701, a memory 702 including one or more computer readable storage media, and an input unit 703.
  • RF radio frequency
  • the terminal structure shown in FIG. 7 does not constitute a limitation to the terminal, and may include more or less components than those illustrated, or a combination of certain components, or different component arrangements. among them:
  • the RF circuit 701 can be used for transmitting and receiving information or during a call, and receiving and transmitting signals. Specifically, after receiving downlink information of the base station, the downlink information is processed by one or more processors 708. In addition, the data related to the uplink is sent to the base station. .
  • the RF circuit 701 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, and a low noise amplifier (LNA, Low Noise Amplifier), duplexer, etc. In addition, the RF circuit 701 can also communicate with the network and other devices through wireless communication.
  • SIM Subscriber Identity Module
  • LNA Low Noise Amplifier
  • the wireless communication may use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), General Packet Radio Service (GPRS), and Code Division Multiple Access (CDMA). , Code Division Multiple Access), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), e-mail, Short Messaging Service (SMS), and the like.
  • GSM Global System of Mobile communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • SMS Short Messaging Service
  • the memory 702 can be used to store software programs and modules, and the processor 708 executes various functional applications and data processing by running software programs and modules stored in the memory 702.
  • the memory 702 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to Data created by the use of the terminal (such as audio data, phone book, etc.).
  • memory 702 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 702 may also include a memory controller to provide access to memory 702 by processor 708 and input unit 703.
  • the input unit 703 can be configured to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function controls.
  • input unit 703 can include a touch-sensitive surface as well as other input devices.
  • Touch-sensitive surfaces also known as touch screens or trackpads, collect touch operations on or near the user (such as the user using a finger, stylus, etc., any suitable object or accessory on a touch-sensitive surface or touch-sensitive Operation near the surface), and drive the corresponding connecting device according to a preset program.
  • the touch-sensitive surface can include two portions of a touch detection device and a touch controller.
  • the touch detection device detects the touch orientation of the user, and detects a signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts the touch information into contact coordinates, and sends the touch information.
  • the processor 708 is provided and can receive commands from the processor 708 and execute them.
  • touch-sensitive surfaces can be implemented in a variety of types, including resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 703 can also include other input devices.
  • other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, joysticks, and the like.
  • Display unit 704 can be used to display information entered by the user or information provided to the user, as well as various graphical user interfaces of the terminal, which can be composed of graphics, text, icons, video, and any combination thereof.
  • the display unit 704 can include a display panel.
  • the display panel can be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
  • the touch-sensitive surface can cover the display panel, and when the touch-sensitive surface detects a touch operation thereon or nearby, it is transmitted to the processor 708 to determine the type of the touch event, and then the processor 708 displays the type according to the type of the touch event. A corresponding visual output is provided on the panel.
  • the touch-sensitive surface and display panel are implemented as two separate components to perform input and input functions, in some embodiments, the touch-sensitive surface can be integrated with the display panel to implement input and output functions.
  • the terminal may also include at least one type of sensor 705, such as a light sensor, motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel according to the brightness of the ambient light, and the proximity sensor may close the display panel and/or the backlight when the terminal moves to the ear.
  • the gravity acceleration sensor can detect the magnitude of acceleration in all directions (usually three axes). When it is stationary, it can detect the magnitude and direction of gravity.
  • the terminal can also be configured with gyroscopes, barometers, hygrometers, thermometers, infrared sensors and other sensors, no longer Narration.
  • the audio circuit 706, the speaker, and the microphone provide an audio interface between the user and the terminal.
  • the audio circuit 706 can transmit the converted electrical signal of the audio data to the speaker, and convert it into a sound signal output by the speaker; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 706 and then converted.
  • the audio data is then processed by the audio data output processor 708, sent via RF circuitry 701 to, for example, another terminal, or the audio data is output to memory 702 for further processing.
  • the audio circuit 706 may also include an earbud jack to provide communication between the peripheral earphone and the terminal.
  • WiFi is a short-range wireless transmission technology.
  • the terminal can help users to send and receive emails, browse web pages, and access streaming media through the WiFi module 707. It provides wireless broadband Internet access for users.
  • FIG. 7 shows the WiFi module 707, it can be understood that it does not belong to the necessary configuration of the terminal, and can be omitted as needed within the scope of not changing the essence of the application.
  • the processor 708 is the control center of the terminal, which connects various portions of the entire handset using various interfaces and lines, by executing or executing software programs and/or modules stored in the memory 702, and invoking data stored in the memory 702, executing The terminal's various functions and processing data, so as to monitor the terminal as a whole.
  • the processor 708 may include one or more processing cores; in some embodiments of the present application, the processor 708 may integrate an application processor and a modem processor, where the application processor mainly processes The operating system, user interface, applications, etc., the modem processor primarily handles wireless communications. It will be appreciated that the above described modem processor may also not be integrated into the processor 708.
  • the terminal also includes a power source 709 (such as a battery) that supplies power to the various components.
  • the power source can be logically coupled to the processor 708 through a power management system to manage functions such as charging, discharging, and power management through the power management system.
  • the power supply 709 can also include any one or more of a DC or AC power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
  • the terminal may further include a camera, a Bluetooth module, and the like, and details are not described herein again.
  • the processor 708 in the terminal runs at least one instruction stored in the memory 702, thereby implementing the image description generation method provided in the foregoing various method embodiments, and/or the model training method.
  • FIG. 8 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server is for implementing the image description generation method provided in the various embodiments described above, and/or the model training method. Specifically:
  • the server 800 includes a central processing unit (CPU) 801, a system memory 804 including a random access memory (RAM) 802 and a read only memory (ROM) 803, and a system bus 805 that connects the system memory 804 and the central processing unit 801.
  • the server 800 also includes a basic input/output system (I/O system) 806 that facilitates transfer of information between various devices within the computer, and mass storage for storing the operating system 813, applications 814, and other program modules 815.
  • I/O system basic input/output system
  • the basic input/output system 806 includes a display 808 for displaying information and an input device 809 such as a mouse or keyboard for user input of information.
  • the display 808 and input device 809 are both connected to the central processing unit 801 via an input and output controller 810 that is coupled to the system bus 805.
  • the basic input/output system 806 can also include an input output controller 810 for receiving and processing input from a plurality of other devices, such as a keyboard, mouse, or electronic stylus.
  • input and output controller 810 also provides output to a display screen, printer, or other type of output device.
  • the mass storage device 807 is connected to the central processing unit 801 by a mass storage controller (not shown) connected to the system bus 805.
  • the mass storage device 807 and its associated computer readable medium provide non-volatile storage for the server 800. That is, the mass storage device 807 can include a computer readable medium (not shown) such as a hard disk or a CD-ROM drive.
  • the computer readable medium can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory or other solid state storage technologies, CD-ROM, DVD or other optical storage, tape cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices.
  • RAM random access memory
  • ROM read only memory
  • EPROM Erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • the server 800 can also be operated by a remote computer connected to the network through a network such as the Internet. That is, the server 800 can be connected to the network 812 through a network interface unit 811 connected to the system bus 805, or can be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811. .
  • the memory also includes at least one instruction and is configured to be executed by one or more processors.
  • the at least one instruction includes instructions for performing the image description generation method provided by the various embodiments described above, and/or the model training method.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)
  • Color Printing (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Endoscopes (AREA)

Abstract

本申请实施例公开了一种图像描述生成方法、模型训练方法、设备和存储介质。所述方法包括:获取目标图像;生成目标图像的第一全局特征向量和第一标注向量集合;输入目标图像至匹配模型,通过匹配模型生成目标图像的第一多模态特征向量;匹配模型为根据训练图像和训练图像的参考图像描述信息训练得到的模型;根据第一多模态特征向量、第一全局特征向量和第一标注向量集合,生成目标图像的目标图像描述信息。

Description

图像描述生成方法、模型训练方法、设备和存储介质
本申请要求于2017年8月30日提交中国专利局、申请号为201710763735.3,申请名称为“图像描述生成方法、模型训练方法、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及机器学习技术领域,特别涉及一种图像描述生成方法、模型训练方法、设备和存储介质。
背景技术
随着图像识别技术的发展,通过机器可读指令就能将图像的内容信息转化为图像的文字描述。
发明内容
本申请实施例提供了一种图像描述生成方法、模型训练方法、终端和存储介质。
本申请实施例提供一种图像描述生成方法,该方法应用于计算设备,包括以下步骤:
获取目标图像;
生成所述目标图像的第一全局特征向量和第一标注向量集合;
输入所述目标图像至匹配模型,通过所述匹配模型生成所述目标图像的第一多模态特征向量,其中,所述匹配模型为根据训练图像和所述训练图像的参考图像描述信息训练得到的模型;
根据所述第一多模态特征向量、所述第一全局特征向量和所述第一标注向量集合,生成所述目标图像的目标图像描述信息。
本申请实施例提供一种模型训练方法,用于训练匹配模型和计算模 型,该方法应用于计算设备,包括以下步骤:
获取训练图像的全局特征向量和标注向量集合,以及所述训练图像的参考图像描述信息的文本特征向量;
根据所述全局特征向量和所述文本特征向量训练匹配模型。
本申请实施例提供了一种生成设备,所述生成设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上所述的图像描述生成方法。
本申请实施例提供了一种训练设备,所述训练设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上所述的模型训练方法。
本申请实施例提供了一种计算机存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如上所述的图像描述生成方法。
本申请实施例提供了一种计算机存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如上所述的模型训练方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图 仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请各个实施例提供的图像描述生成方法和模型训练方法所涉及的实施环境的示意图;
图2是本申请一个实施例提供的模型训练方法的方法流程图;
图3是本申请一个实施例提供的图像描述生成方法的方法流程图;
图4是本申请一个实施例提供的图像描述生成方法的流程图;
图5A是本申请一个实施例提供的图像描述生成装置的结构示意图;
图5B是本申请一个实施例提供的图像描述生成装置的结构示意图;
图5C是本申请一个实施例提供的图像描述生成装置的结构示意图;
图6A是本申请一个实施例提供的模型训练装置的结构示意图;
图6B是本申请一个实施例提供的模型训练装置的结构示意图;
图6C是本申请一个实施例提供的模型训练装置的结构示意图;
图7是本申请一个实施例提供的终端的结构示意图;
图8是本申请一个实施例提供的服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施例作进一步地详细描述。
为了便于描述,下面先对本申请各个实施例中所涉及的术语做简单介绍。
CNN(Convolution Neural Network,卷积神经网络),是一种直接从图像底层的像素特征开始,逐层对图像进行特征提取的前馈神经网络,是编码器最常用的实现模型,负责将图像编码成向量。
RNN(Recurrent Neural Network,递归神经网络),是一种具有固定权值、外部输入和内部状态的神经网络,可以将其看作是以权值和外部输入为参数,关于内部状态的行为动力学。RNN是解码器最常用的实现模型,负责将编码器生成的图像向量翻译成图像的文字描述。
LSTM(Long-Short-Term Memory,长短时记忆),是一种时间递归神经网络,用于处理和预测时间序列中间隔或者延迟相对较长时间的重要事件,属于一种特殊的RNN。
注意力机制(Attention Mechanism),常被运用在RNN上。带有注意力机制的RNN,在每次处理目标图像的部分像素时,都会根据当前状态的前一个状态所关注的目标图像的部分像素去处理,而不是根据目标图像的全部像素去处理,可以减少任务的处理复杂度。
SGD(Stochastic Gradient Descent,随机梯度下降),是一种最小化目标函数的方法,在每次迭代一个或者一批新的样本时,只考虑将当前样本点的损失趋于最小而不考虑其他样本点,且每迭代一个或者一批新的样本就会更新一次目标函数中的所有参数。
交叉熵代价函数(Cross-Entropy Cost Function),是一种用来计算神经网络的预测分布与实际分布之间的误差的方法,在反向传播训练神经网络的过程中,若预测分布与实际分布之间的误差越大,则对神经网络的各种参数的调整幅度越大。
在相关技术中,图像描述生成方法可包括如下步骤:首先通过编码器,如特征提取模型,对获取到的目标图像进行编码,生成目标图像的全局特征向量和标注向量集合,然后输入目标图像的全局特征向量和标注向量集合至解码器,如计算模型,最后得到目标图像的描述信息。
相关技术提供的图像描述生成方法中,解码器的输入参数只包括目标图像的全局特征向量和标注向量集合,也即解码器的输入参数只包括目标图像的图像信息,容易导致生成的图像描述信息不够准确。
有鉴于此,本申请实施例提供了一种图像描述生成方法、模型训练方法、终端和存储介质,可以提高生成的图像描述信息的准确率。具体技术方案将在下面详细描述。
请参考图1,其示出了本申请各个实施例提供的图像描述生成方法和模型训练方法所涉及的实施环境的示意图,如图1所示,该实施环境包括:训练设备110和生成设备120。
训练设备110是指用于训练描述生成模型的设备。该描述生成模型用于根据训练图像和其所对应的参考图像描述信息生成训练图像的描述信息。实际实现时,该训练设备110可以为诸如电脑终端、手机终端和服务器之类的可以实现复杂算法的计算设备。
在本申请一些实施例中,该描述生成模型包括特征提取模型、匹配模型和计算模型。特征提取模型用于根据训练图像生成训练图像的全局特征向量和标注向量集合,以及根据训练图像的参考图像描述信息生成对应的文本特征向量;匹配模型用于根据通过特征提取模型获取到的全局特征向量和文本特征向量生成训练图像的多模态特征向量;计算模型用于根据匹配模型生成的多模态特征向量,以及特征提取模型生成的全局特征向量和标注向量集合,生成训练图像的描述信息。实际实现时,训练设备110会根据生成的描述信息和训练图像的参考图像描述信息不断地训练描述生成模型中的计算模型。
生成设备120是指用于根据描述生成模型生成目标图像的描述信息的设备。实际实现时,该生成设备120可以为诸如电脑终端、手机终端和服务器之类的可以实现复杂算法的计算设备。
在本申请一些实施例中,训练设备110和生成设备120可以为同一个设备,也可以为不同的设备。若训练设备110和生成设备120为同一个设备,则生成设备120中的描述生成模型即为自身预先训练并存储的模型;若训练设备110和生成设备120为不同的设备,则生成设备120中的描述生成模型可以为从训练设备110中获取的由训练设备110训练得到的模型。
请参考图2,其示出了本申请一个实施例提供的模型训练方法的方法流程图,本实施例以该模型训练方法用于图1所示的训练设备中来举例说明。如图2所示,该模型训练方法可以包括以下步骤:
步骤201,获取训练图像的全局特征向量和标注向量集合,以及训练图像的参考图像描述信息的文本特征向量。
训练图像为预先设置的用于训练的图像,全局特征向量为描述训练 图像的整体特征的具有预设长度的向量,标注向量集合为多个描述训练图像的子区域特征的向量的集合,训练图像的参考图像描述信息为预先设置的用于描述对应的训练图像的文本信息。其中,训练图像可以包括至少一张图像,且实际实现时,为了增加训练样本进而提高训练准确度,训练图像可以包括多张,每个训练图像的参考图像描述信息可以为3至5个语句且每个语句都为能单独描述该训练图像的完整内容的语句。
在本申请一些实施例中,可以通过特征提取模型获取训练图像的全局特征向量和标注向量集合,以及训练图像的参考图像描述信息的文本特征向量。特征提取模型包括两个部分,其中,获取全局特征向量和标注向量集合的步骤包括:通过特征提取模型的第一部分对训练图像进行编码,生成训练图像的全局特征向量和标注向量集合;获取文本特征向量的步骤包括:通过特征提取模型的第二部分对训练图像的参考图像描述信息进行编码,生成对应的文本特征向量。在本申请一些实施例中,特征提取模型的第一部分可以为预先训练好的CNN,CNN包括多个卷积层和多个全连接层,则可以通过CNN的最后一个全连接层生成全局特征向量,并通过CNN的第四个卷积层生成标注向量集合,比如,第一部分为VGG(Visual Geometry Group,视觉几何组)网络。全连接层为输出层的每个神经元和输入层的每个神经元都连接的网络层。在本申请一些实施例中,特征提取模型的第二部分可以通过费舍尔向量Fisher Vector技术对训练图像的参考图像描述信息进行编码。
实际实现时,可以先通过第一部分生成全局特征向量和标注向量集合,之后通过第二部分生成文本特征向量,也可以先通过第二部分生成文本特征向量,之后通过第一部分生成全局特征向量和标注向量集合,还可以在通过第一部分生成全局特征向量和标注向量集合的同时,通过第二部分生成文本特征向量。
步骤202,根据全局特征向量和文本特征向量训练匹配模型,并通过训练得到的匹配模型生成训练图像的多模态特征向量。
匹配模型包括两个部分,其中,第一部分用于将全局特征向量转化成全局特征匹配向量,第二部分用于将文本特征向量转化成文本特征匹 配向量。在本申请一些实施例中,匹配模型的第一部分可以为第一神经网络,匹配模型的第二部分可以为第二神经网络,并且,第一神经网络,和/或,第二神经网络可以为全连接多层神经网络。
下述除特殊说明外,均以匹配模型的第一部分为第一神经网络且第二部分为第二神经网络来举例说明。
将所有训练图像的全局特征匹配向量和每一个训练图像对应的所有文本特征匹配向量映射到第一神经网络的隐含空间,在这个隐含空间中计算每一个全局特征匹配向量和每一个文本特征匹配向量的匹配度,也即比较每一个全局特征匹配向量和每一个文本特征匹配向量的相似度,并根据匹配度调整每一个全局特征匹配向量和每一个文本特征匹配向量的位置关系,也即训练匹配模型,使得描述同一个训练图像的全局特征匹配向量和文本特征匹配向量的距离比较近,描述不同训练图像的全局特征匹配向量和文本特征匹配向量的距离比较远,以及,使得描述同一个训练图像的文本特征匹配向量彼此之间的距离比较近。在本申请一些实施例中,在调整过程中可以采用排序损失(Rank-Loss)方法获取全局特征匹配向量和文本特征匹配向量在分布上的目标损失函数,并通过SGD对目标损失函数进行处理。其中,判断匹配模型是否训练完毕的条件包括:检测训练过程中目标损失函数的值是否变化;若目标损失函数的值不变,则匹配模型训练完毕。
在匹配模型训练完毕后,将训练图像再次输入至第一神经网络,得到训练图像的多模态特征向量。
步骤203,输入多模态特征向量、全局特征向量和标注向量集合至计算模型,得到训练图像的图像描述信息。
实际实现时,计算模型包括n个深度网络,n为正整数,则本步骤包括:根据多模态特征向量、全局特征向量、标注向量集合和这n个深度网络,生成图像描述信息。其中,n个深度网络中的至少一个深度网络的输入参数包括拼接向量,当第i个深度网络的输入参数包括拼接向量时,若i=1,则拼接向量为多模态特征向量和标注向量集合拼接得到的向量,若i>1,则拼接向量为第i-1个深度网络的输出向量和多模态 特征向量拼接得到的向量,1≤i≤n,即i大于或等于1,且小于或等于n。比如,计算模型只包括一个深度网络,则该深度网络的输入参数包括拼接向量,且拼接向量为多模态特征向量和标注向量集合拼接得到的向量;又比如,计算模型包括三个深度网络,其中,第3个深度网络的输入参数包括拼接向量,则该拼接向量为第2个深度网络的输出向量和多模态特征向量拼接得到的向量。在本申请一些实施例中,这n个深度网络可以为带注意力机制的LSTM,也可以为GRU(Gated Recurrent Unit,门控性循环单元),还可以为其他的RNN。
为了便于描述,下述以这n个深度网络为带注意力机制的LSTM且n=2来举例说明,则根据多模态特征向量、全局特征向量、标注向量集合和这n个深度网络,生成图像描述信息的具体步骤包括:
第一,将多模态特征向量M和标注向量集合A拼接,得到第一拼接向量A'。
其中,将多模态特征向量M和标注向量集合A拼接,仅为形式上的加法拼接,比如,多模态特征向量M的长度为n1,标注向量集合A的长度为n2,则第一拼接向量A'的长度为n1+n2。实际实现时,在拼接过程中,通常把标注向量集合A放在上方,而把多模态特征向量M放在下方。
第二,输入第一拼接向量A'和全局特征向量至第1个深度网络,得到第一输出向量h(t)。
当深度网络为带注意力机制的LSTM时,第1个深度网络可以表示为一个带内部状态的函数:h(t)=LSTM(0,h(t-1),A')。其中,h(t)为经过当前时间步骤t之后LSTM的隐含状态,也即LSTM中间隐层记忆单元的激活函数的输出值,h(t-1)是经过前一个时间步骤t-1之后LSTM的隐含状态。
第三,将第一输出向量h(t)和多模态特征向量M拼接,得到第二拼接向量A"。
将第一输出向量h(t)和多模态特征向量M拼接,同第一步骤中的方法类似,也为形式上的加法拼接,在此不再赘述。实际实现时,在两次 拼接过程中,多模态特征向量M所在的位置应保持一致,比如,两次拼接都为将多模态特征向量M放在下方。
第四,输入第二拼接向量A"至第2个深度网络,得到图像描述信息。
当深度网络为带注意力机制的LSTM时,第2个深度网络也可以表示为一个带内部状态的函数,但不同的是,h(t)=LSTM(x(t),h(t-1),A″)。其中,在每一个时间步骤t时,第2个深度网络的输入包括参考图像描述信息中第t个字的嵌入向量(Embedding Vector)x(t)。
对于在每一个时间步骤t时输出的h(t),采用线性回归方法对图像描述信息中的下一个生成的字进行预测,得到对应的中文字,最后得到图像描述信息。图像描述信息为一个能单独描述该训练图像的完整内容的语句。在本申请一些实施例中,该线性回归方法可以为Softmax回归方法。
步骤204,若参考图像描述信息和生成的图像描述信息不匹配,则根据图像描述信息和参考图像描述信息训练计算模型。
判断生成的图像描述信息与训练图像的参考图像描述信息是否匹配,也即计算两者的误差,实际实现时,采用交叉熵代价函数作为损失函数来计算预测的字的分布和真实的字的分布之间的误差,并通过SGD不断地调整计算模型中的各类参数,对计算模型进行优化也即训练计算模型,直至损失函数的值不再发生变化,也即两者的误差值无法再减小。
综上所述,本实施例提供的模型训练方法,通过根据训练图像和其所对应的参考图像描述信息训练匹配模型,使得通过训练完毕后的匹配模型生成的多模态特征向量包含预测的文本信息;再将包含预测的文本信息的多模态特征向量输入至计算模型,使得通过计算模型生成的训练图像的描述信息更为准确;最后根据较为准确的描述信息和参考图像描述信息训练计算模型;达到了提高描述生成模型生成的图像描述信息的准确率的效果。
请参考图3,其示出了本申请一个实施例提供的图像描述生成方法 的方法流程图,本实施例以该图像描述生成方法用于图1所示的生成设备中来举例说明。如图3所示,该图像描述生成方法可以包括以下步骤:
在训练设备训练描述生成模型完毕后,生成设备会获取该描述生成模型。在本申请一些实施例中,生成设备获取该描述生成模型的步骤可以包括:生成设备发送获取请求至训练设备,接收训练设备返回的描述生成模型,或者,生成设备接收训练设备主动发送的描述生成模型。
步骤301,获取目标图像。
在本申请一些实施例中,可以读取预先存储的目标图像。其中,目标图像可以为生成设备自身采集并保存的图像,也可以为预先从其他设备中获取并保存的图像,还可以为预先从网络中下载并保存的图像。当然,实际实现时,生成设备还可以发送图像获取请求至其他设备,接收其他设备返回的目标图像;或者,接收其他设备主动发送的目标图像。
实际实现时,目标图像一般与训练图像为不同的图像。
步骤302,生成目标图像的第一全局特征向量和第一标注向量集合。
输入目标图像至特征提取模型,其中,在生成图像描述的过程中,只需通过特征提取模型中的第一部分对目标图像进行编码,生成目标图像的第一全局特征向量和第一标注向量集合即可。
步骤303,输入目标图像至匹配模型,通过匹配模型生成目标图像的第一多模态特征向量;匹配模型为根据训练图像和训练图像的参考图像描述信息训练得到的模型。
实际实现时,通过训练完毕的匹配模型的第一部分对目标图像进行编码,生成目标图像的第一多模态特征向量。由于本实施例中的匹配模型为上述实施例中已经训练完毕的匹配模型,因此,生成的多模态特征向量包含预测的文本信息。
步骤304,根据第一多模态特征向量、第一全局特征向量和第一标注向量集合,生成目标图像的目标图像描述信息。
实际实现时,通过输入第一多模态特征向量、第一全局特征向量和第一标注向量集合至计算模型,得到目标图像描述信息。其中,本实施例中的计算模型为上述实施例中根据训练图像的图像描述信息和参考 图像描述信息训练得到的计算模型。
在本申请一些实施例中,计算模型包括n个深度网络,n为正整数,则本步骤包括:根据第一多模态特征向量、第一全局特征向量、第一标注向量集合和这n个深度网络,生成图像描述信息。其中,n个深度网络中的至少一个深度网络的输入参数包括拼接向量,当第i个深度网络的输入参数包括拼接向量时,若i=1,则拼接向量为第一多模态特征向量和第一标注向量集合拼接得到的向量,若i>1,则拼接向量为第i-1个深度网络的输出向量和第一多模态特征向量拼接得到的向量,1≤i≤n。在本申请一些实施例中,这n个深度网络可以为带注意力机制的LSTM,也可以为其他的RNN。
为了便于描述,下述以这n个深度网络为带注意力机制的LSTM且n=2来举例说明,则根据第一多模态特征向量、第一全局特征向量、第一标注向量集合和这n个深度网络,生成图像描述信息的具体步骤包括:
第一,将第一多模态特征向量M和第一标注向量集合A拼接,得到第一拼接向量A'。
第二,输入第一拼接向量A'和第一全局特征向量至第1个深度网络,得到第一输出向量h(t)。
第三,将第一输出向量h(t)和第一多模态特征向量M拼接,得到第二拼接向量A"。
第四,输入第二拼接向量A"至第2个深度网络,得到目标图像描述信息。
当深度网络为带注意力机制的LSTM时,第2个深度网络同样可以表示为一个带内部状态的函数:h(t)=LSTM(x(t),h(t-1),A″)。但不同的是,在每一个时间步骤t时,根据第2个深度网络的输出h(t)计算目标图像描述信息中下一个生成的中文字的分布,再通过贪心搜索算法或者束搜索(beam search)算法确定下一个生成的中文字,并将其作为在下一个时间步骤t+1时函数h(t)的输入向量x(t),经过不断地递归运算后,最终得到完整的目标图像描述信息。
步骤304的具体实施过程同上述实施例中的步骤203类似,在此不 再赘述。
上述实施例提供的图像描述生成方法,通常用于需要具备图像检索功能的设备中。在描述生成模型训练完毕后,该设备会采集大量的目标图像,通过已经训练好的描述生成模型为每一个目标图像生成其所对应的目标图像描述信息,并将目标图像与目标图像描述信息以一一对应的方式存储在设备的数据库中。当用户需要查询某个或者某些图像时,只需输入所需查询的图像的图像描述信息即可,输入的图像描述信息可以为至少一个描述图像内容的关键词,也可以为一句完整的文字描述。该设备会根据用户输入的图像描述信息,在数据库中查找是否存在与之相关的目标图像描述信息,进而找到对应的目标图像,并将找到的目标图像提供给用户。在本申请一些实施例中,计算模型通常包括2个基于注意力机制的LSTM网络,如图4所示,下面以目标图像为图像1来举例说明上述图像描述生成方法的具体步骤。
第一,获取目标图像。图像1为设备在日常生活中采集得到的图像,图像1的图像内容为“一群人围坐在餐桌前拍合照”。
第二,输入目标图像至特征提取模型,通过特征提取模型的第一部分对目标图像进行编码,得到目标图像的全局特征向量和标注向量集合。
第三,输入目标图像至训练完毕的匹配模型,通过该匹配模型的第一部分对目标图像进行编码,得到目标图像的多模态特征向量。其中,该匹配模型为根据训练图像和训练图像的参考图像描述信息训练得到的模型。
第四,将匹配模型输出的多模态特征向量和特征提取模型输出的标注向量集合拼接,得到第一拼接向量,并将第一拼接向量和全局特征向量输入至第1个深度网络,得到第一输出向量。
第五,将多模态特征向量和第一输出向量拼接,得到第二拼接向量,并将第二拼接向量输入至第2个深度网络。在每一个时间步骤t时,根据输出向量h(t)计算目标图像描述信息中下一个生成的中文字的分布,再通过贪心搜索算法或者束搜索算法确定下一个生成的中文字,并将其 作为在下一个时间步骤t+1时函数h(t)的输入向量x(t),比如,在第一个时间步骤时,根据输出向量h(1)计算第一个生成的字的分布,再通过算法确定第一个生成的字为“一”,并将“一”作为第二个时间步骤中函数h(2)的输入向量x(2),同样根据输出向量h(2)计算第二个生成的字的分布,再通过算法确定第二个生成的字为“群”,并将“群”作为第三个时间步骤中函数h(3)的输入向量x(3),以此类推,经过不断地递归运算后,最终得到完整的目标图像描述信息“一群人围坐在餐桌前拍合照”。
综上所述,本实施例提供的图像描述生成方法,通过输入目标图像至匹配模型,得到目标图像的多模态特征向量,由于匹配模型为根据训练图像和训练图像的参考图像描述信息训练得到的模型,因此通过匹配模型生成的多模态特征向量包含预测的文本信息;再将包含预测的文本信息的多模态特征向量输入至计算模型,得到目标图像的目标图像描述信息,使得生成的目标图像描述信息更为准确,达到了提高生成的图像描述信息的准确率的效果。
需要说明的第一点是,本实施例对上述步骤302和303的先后执行顺序并不做限定,只需在步骤304之前执行即可。实际实现时,也可以先执行步骤303,再执行步骤302,还可以在执行步骤302的同时,执行步骤303。
需要说明的第二点是,若上述各个实施例提供的描述生成模型运用在英文场合,则通过描述生成模型生成的目标图像描述信息为英文形式的描述信息,而当用户需要查询某个或者某些图像时,向设备输入的所需查询的图像的图像描述信息,也皆为英文形式的关键词或者文字描述。因此,生成目标图像描述信息的过程会发生微小的变化,下面仍以目标图像为图像1,且计算模型为2个带注意力机制的LSTM网络来举例说明,具体步骤包括:
第一,获取目标图像,图像1为设备在日常生活中采集得到的图像, 图像1的图像内容为“a group of people are sitting at the table and taking a photo”。
第二,输入目标图像至特征提取模型,通过特征提取模型的第一部分对目标图像进行编码,生成目标图像的全局特征向量和标注向量集合。
第三,输入目标图像至训练完毕的匹配模型,通过该匹配模型的第一部分对目标图像进行编码,生成目标图像的多模态特征向量。其中,该匹配模型为根据训练图像和训练图像的参考图像描述信息训练得到的模型。
第四,将匹配模型输出的多模态特征向量和特征提取模型输出的标注向量集合拼接,得到第一拼接向量,并将第一拼接向量和全局特征向量输入至第1个深度网络,得到第一输出向量。
第五,将多模态特征向量和第一输出向量拼接,得到第二拼接向量,并将第二拼接向量输入至第2个深度网络。在每一个时间步骤t时,根据输出向量h(t)计算目标图像描述信息中下一个生成的英文单词的分布,再通过贪心搜索算法或者束搜索算法确定下一个生成的英文单词,并将其作为在下一个时间步骤t+1时函数h(t)的输入向量x(t),比如,在第一个时间步骤时,根据输出向量h(1)计算第一个生成的英文单词的分布,再通过算法确定第一个生成的英文单词为“a”,并将“a”作为第二个时间步骤中函数h(2)的输入向量x(2),同样根据输出向量h(2)计算第二个生成的英文单词的分布,再通过算法确定第二个生成的英文单词为“group”,并将“group”作为第三个时间步骤中函数h(3)的输入向量x(3),以此类推,经过不断地递归运算后,最终得到完整的目标图像描述信息“a group of people are sitting at the table and taking a photo”。
请参考图5A,其示出了本申请一个实施例提供的图像描述生成装置的结构示意图,如图5A所示,该图像描述生成装置可以包括:获取模块510和生成模块520。
获取模块510,用于获取目标图像;
生成模块520,用于生成所述目标图像的第一全局特征向量和第一 标注向量集合;
所述生成模块520,还用于输入所述获取模块510获取到的所述目标图像至匹配模型,通过所述匹配模型生成所述目标图像的第一多模态特征向量;所述匹配模型为根据训练图像和所述训练图像的参考图像描述信息训练得到的模型;
所述生成模块520,还用于根据所述第一多模态特征向量、所述第一全局特征向量和所述第一标注向量集合,生成所述目标图像的目标图像描述信息。
综上所述,本实施例提供的图像描述生成装置,通过输入目标图像至匹配模型,得到目标图像的多模态特征向量,由于匹配模型为根据训练图像和训练图像的参考图像描述信息训练得到的模型,因此通过匹配模型生成的多模态特征向量包含预测的文本信息;再将包含预测的文本信息的多模态特征向量输入至计算模型,使得通过计算模型生成的目标图像的目标图像描述信息更为准确,达到了提高生成的图像描述信息的准确率的效果。
基于上述实施例提供的图像描述生成装置,在本申请一些实施例中,
所述生成模块520,还用于输入所述第一多模态特征向量、所述第一全局特征向量和所述第一标注向量集合至计算模型,得到所述目标图像描述信息;所述计算模型为根据所述训练图像的图像描述信息和所述参考图像描述信息训练得到的模型。
在本申请一些实施例中,所述计算模型包括n个深度网络,n为正整数。
所述生成模块520,还用于根据所述第一多模态特征向量、所述第一全局特征向量、所述第一标注向量集合和所述n个深度网络,生成所述目标图像描述信息;
其中,所述n个深度网络中的至少一个深度网络的输入参数包括拼接向量,当第i个深度网络的输入参数包括所述拼接向量时,若i=1,则所述拼接向量为所述第一多模态特征向量和所述第一标注向量集合拼 接得到的向量,若i>1,则所述拼接向量为第i-1个深度网络的输出向量和所述第一多模态特征向量拼接得到的向量,1≤i≤n。
在本申请一些实施例中,所述n=2,所述装置还包括:拼接模块530,如图5B所示。
拼接模块530,用于将所述第一多模态特征向量和所述第一标注向量集合拼接,得到第一拼接向量;
所述生成模块520,还用于输入所述第一拼接向量和所述第一全局特征向量至第1个深度网络,得到第一输出向量;
所述拼接模块530,还用于将所述第一输出向量和所述第一多模态特征向量拼接,得到第二拼接向量;
所述生成模块520,还用于输入所述第二拼接向量至第2个深度网络,得到所述目标图像描述信息。
在本申请一些实施例中,所述装置还包括:训练模块540,如图5C所示。
所述获取模块510,还用于获取所述训练图像的第二全局特征向量和第二标注向量集合,以及所述训练图像的参考图像描述信息的文本特征向量;
训练模块540,用于根据所述第二全局特征向量和所述文本特征向量训练所述匹配模型。
在本申请一些实施例中,
所述生成模块520,还用于通过训练得到的匹配模型生成所述训练图像的第二多模态特征向量;
所述生成模块520,还用于输入所述第二多模态特征向量、所述第二全局特征向量和所述第二标注向量集合至计算模型,得到所述训练图像的图像描述信息;
所述训练模块540,还用于在所述参考图像描述信息和生成的所述图像描述信息不匹配时,根据所述图像描述信息和所述参考图像描述信息训练所述计算模型。
需要说明的是:上述实施例提供的图像描述生成装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将服务器或终端等计算设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的图像描述生成装置和图像描述生成方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图6A,其示出了本申请一个实施例提供的模型训练装置的结构示意图,如图6A所示,该模型训练装置用于训练如上述实施例中所述的匹配模型和计算模型,该装置可以包括:获取模块610和训练模块620。
获取模块610,用于获取训练图像的全局特征向量和标注向量集合,以及所述训练图像的参考图像描述信息的文本特征向量;
训练模块620,用于根据所述全局特征向量和所述文本特征向量训练匹配模型。
综上所述,本实施例提供的模型训练装置,通过根据训练图像和其所对应的参考图像描述信息训练匹配模型,使得通过训练完毕后的匹配模型生成的多模态特征向量包含预测的文本信息;再将包含预测的文本信息的多模态特征向量输入至计算模型,使得通过计算模型生成的训练图像的描述信息更为准确;最后根据较为准确的训练图像的描述信息和训练图像的参考图像描述信息训练计算模型;达到了提高描述生成模型生成的图像描述信息的准确率的效果。
基于上述实施例提供的模型训练装置,在本申请一些实施例中,所述装置还包括:生成模块630,如图6B所示。
生成模块630,用于通过训练得到的匹配模型生成所述训练图像的多模态特征向量;
所述生成模块630,还用于输入所述多模态特征向量、所述全局特征向量和所述标注向量集合至计算模型,得到所述训练图像的图像描述 信息;
所述训练模块620,还用于在所述参考图像描述信息和生成的所述图像描述信息不匹配时,根据所述图像描述信息和所述参考图像描述信息训练所述计算模型
在本申请一些实施例中,所述计算模型包括n个深度网络,n为正整数,
所述生成模块630,还用于根据所述多模态特征向量、所述全局特征向量、所述标注向量集合和所述n个深度网络,生成所述图像描述信息;
其中,所述n个深度网络中的至少一个深度网络的输入参数包括拼接向量,当第i个深度网络的输入参数包括所述拼接向量时,若i=1,则所述拼接向量为所述多模态特征向量和所述标注向量集合拼接得到的向量,若i>1,则所述拼接向量为第i-1个深度网络的输出向量和所述多模态特征向量拼接得到的向量,1≤i≤n。
在本申请一些实施例中,所述n=2,所述装置还包括:拼接模块640,如图6C所示。
拼接模块640,用于将所述多模态特征向量和所述标注向量集合拼接,得到第一拼接向量;
所述生成模块630,还用于输入所述第一拼接向量和所述全局特征向量至第1个深度网络,得到第一输出向量;
所述拼接模块640,还用于将所述第一输出向量和所述多模态特征向量拼接,得到第二拼接向量;
所述生成模块630,还用于输入所述第二拼接向量至第2个深度网络,得到所述图像描述信息。
需要说明的是:上述实施例提供的模型训练装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将服务器或终端等计算设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上 述实施例提供的模型训练装置和模型训练方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质可以是存储器中所包含的计算机可读存储介质;也可以是单独存在,未装配入终端或者服务器中的计算机可读存储介质。该计算机可读存储介质存储有至少一条指令、至少一段程序、代码集或指令集,并且,当该计算机可读存储介质用于生成设备中时,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述实施例中的图像描述生成方法;
当该计算机可读存储介质用于训练设备中时,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述实施例中的模型训练方法。
图7其示出了本申请一个实施例提供的终端700的框图,该终端可以包括射频(RF,Radio Frequency)电路701、包括有一个或一个以上计算机可读存储介质的存储器702、输入单元703、显示单元704、传感器705、音频电路706、无线保真(WiFi,Wireless Fidelity)模块707、包括有一个或者一个以上处理核心的处理器708、以及电源709等部件。本领域技术人员可以理解,图7中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
RF电路701可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,交由一个或者一个以上处理器708处理;另外,将涉及上行的数据发送给基站。通常,RF电路701包括但不限于天线、至少一个放大器、调谐器、一个或多个振荡器、用户身份模块(SIM,Subscriber Identity Module)卡、收发信机、耦合器、低噪声放大器(LNA,Low Noise Amplifier)、双工器等。此外,RF电路701还可以通过无线通信与网络和其他设备通信。所述无线通信可以使 用任一通信标准或协议,包括但不限于全球移动通讯系统(GSM,Global System of Mobile communication)、通用分组无线服务(GPRS,General Packet Radio Service)、码分多址(CDMA,Code Division Multiple Access)、宽带码分多址(WCDMA,Wideband Code Division Multiple Access)、长期演进(LTE,Long Term Evolution)、电子邮件、短消息服务(SMS,Short Messaging Service)等。
存储器702可用于存储软件程序以及模块,处理器708通过运行存储在存储器702的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器702可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器702可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器702还可以包括存储器控制器,以提供处理器708和输入单元703对存储器702的访问。
输入单元703可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。具体地,在一个具体的实施例中,输入单元703可包括触敏表面以及其他输入设备。触敏表面,也称为触摸显示屏或者触控板,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触敏表面上或在触敏表面附近的操作),并根据预先设定的程式驱动相应的连接装置。在本申请一些实施例中,触敏表面可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器708,并能接收处理器708发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触敏表面。除了触敏表面,输入单元703还可以包括其他输入设备。具体地,其他 输入设备可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元704可用于显示由用户输入的信息或提供给用户的信息以及终端的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示单元704可包括显示面板,在本申请一些实施例中,可以采用液晶显示器(LCD,Liquid Crystal Display)、有机发光二极管(OLED,Organic Light-Emitting Diode)等形式来配置显示面板。进一步的,触敏表面可覆盖显示面板,当触敏表面检测到在其上或附近的触摸操作后,传送给处理器708以确定触摸事件的类型,随后处理器708根据触摸事件的类型在显示面板上提供相应的视觉输出。虽然在图7中,触敏表面与显示面板是作为两个独立的部件来实现输入和输入功能,但是在某些实施例中,可以将触敏表面与显示面板集成而实现输入和输出功能。
终端还可包括至少一种传感器705,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板的亮度,接近传感器可在终端移动到耳边时,关闭显示面板和/或背光。作为运动传感器的一种,重力加速度传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于终端还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路706、扬声器,传声器可提供用户与终端之间的音频接口。音频电路706可将接收到的音频数据转换后的电信号,传输到扬声器,由扬声器转换为声音信号输出;另一方面,传声器将收集的声音信号转换为电信号,由音频电路706接收后转换为音频数据,再将音频数据输出处理器708处理后,经RF电路701以发送给比如另一终端,或者将音频数据输出至存储器702以便进一步处理。音频电路706还可能包括耳塞插孔,以提供外设耳机与终端的通信。
WiFi属于短距离无线传输技术,终端通过WiFi模块707可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图7示出了WiFi模块707,但是可以理解的是,其并不属于终端的必须构成,完全可以根据需要在不改变申请的本质的范围内而省略。
处理器708是终端的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器702内的软件程序和/或模块,以及调用存储在存储器702内的数据,执行终端的各种功能和处理数据,从而对终端进行整体监控。在本申请一些实施例中,处理器708可包括一个或多个处理核心;在本申请一些实施例中,处理器708可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器708中。
终端还包括给各个部件供电的电源709(比如电池),优选的,电源可以通过电源管理系统与处理器708逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源709还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
尽管未示出,终端还可以包括摄像头、蓝牙模块等,在此不再赘述。具体在本实施例中,终端中的处理器708会运行存储在存储器702中的至少一条指令,从而实现上述各个方法实施例中所提供的图像描述生成方法,和/或,模型训练方法。
请参考图8,其示出了本申请一个实施例提供的服务器的结构示意图。该服务器用于实施上述各个实施例中所提供的图像描述生成方法,和/或,模型训练方法。具体来讲:
所述服务器800包括中央处理单元(CPU)801、包括随机存取存储器(RAM)802和只读存储器(ROM)803的系统存储器804,以及连接系统存储器804和中央处理单元801的系统总线805。所述服务器800 还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统)806,和用于存储操作系统813、应用程序814和其他程序模块815的大容量存储设备807。
所述基本输入/输出系统806包括有用于显示信息的显示器808和用于用户输入信息的诸如鼠标、键盘之类的输入设备809。其中所述显示器808和输入设备809都通过连接到系统总线805的输入输出控制器810连接到中央处理单元801。所述基本输入/输出系统806还可以包括输入输出控制器810以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器810还提供输出到显示屏、打印机或其他类型的输出设备。
所述大容量存储设备807通过连接到系统总线805的大容量存储控制器(未示出)连接到中央处理单元801。所述大容量存储设备807及其相关联的计算机可读介质为服务器800提供非易失性存储。也就是说,所述大容量存储设备807可以包括诸如硬盘或者CD-ROM驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM、EEPROM、闪存或其他固态存储其技术,CD-ROM、DVD或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器804和大容量存储设备807可以统称为存储器。
根据本申请的各种实施例,所述服务器800还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即服务器800可以通过连接在所述系统总线805上的网络接口单元811连接到网络812,或者说,也可以使用网络接口单元811来连接到其他类型的网络或远程计算机系统(未示出)。
所述存储器还包括至少一条指令,且经配置以由一个或者一个以上 处理器执行。上述至少一条指令包含用于执行上述各个实施例所提供的图像描述生成方法,和/或,模型训练方法的指令。
应当理解的是,在本文中使用的,除非上下文清楚地支持例外情况,单数形式“一个”(“a”、“an”、“the”)旨在也包括复数形式。还应当理解的是,在本文中使用的“和/或”是指包括一个或者一个以上相关联地列出的项目的任意和所有可能组合。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请实施例的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (14)

  1. 一种图像描述生成方法,应用于计算设备,所述方法包括:
    获取目标图像;
    生成所述目标图像的第一全局特征向量和第一标注向量集合;
    输入所述目标图像至匹配模型,通过所述匹配模型生成所述目标图像的第一多模态特征向量,其中,所述匹配模型为根据训练图像和所述训练图像的参考图像描述信息训练得到的模型;
    根据所述第一多模态特征向量、所述第一全局特征向量和所述第一标注向量集合,生成所述目标图像的目标图像描述信息。
  2. 根据权利要求1所述的方法,所述根据所述第一多模态特征向量、所述第一全局特征向量和所述第一标注向量集合,生成所述目标图像的目标图像描述信息,包括:
    输入所述第一多模态特征向量、所述第一全局特征向量和所述第一标注向量集合至计算模型,得到所述目标图像描述信息,其中,所述计算模型为根据所述训练图像的图像描述信息和所述参考图像描述信息训练得到的模型。
  3. 根据权利要求2所述的方法,所述计算模型包括n个深度网络,n为正整数;
    所述输入所述第一多模态特征向量、所述第一全局特征向量和所述第一标注向量集合至计算模型,得到所述目标图像描述信息,包括:
    根据所述第一多模态特征向量、所述第一全局特征向量、所述第一标注向量集合和所述n个深度网络,生成所述目标图像描述信息;
    其中,所述n个深度网络中的至少一个深度网络的输入参数包括拼接向量,当第i个深度网络的输入参数包括所述拼接向量时,若i等于1,则所述拼接向量为所述第一多模态特征向量和所述第一标注向量集合拼接得到的向量,若i大于1,则所述拼接向量为第i-1个深度网络的输出向量和所述第一多模态特征向量拼接得到的向量,其中,i大于或等于1,且小于或等于n。
  4. 根据权利要求3所述的方法,所述n等于2;
    所述根据所述第一多模态特征向量、所述第一全局特征向量、所述第一标注向量和所述n个深度网络,生成所述目标图像描述信息,包括:
    将所述第一多模态特征向量和所述第一标注向量集合拼接,得到第一拼接向量;
    输入所述第一拼接向量和所述第一全局特征向量至第1个深度网络,得到第一输出向量;
    将所述第一输出向量和所述第一多模态特征向量拼接,得到第二拼接向量;
    输入所述第二拼接向量至第2个深度网络,得到所述目标图像描述信息。
  5. 根据权利要求1至4任一所述的方法,所述方法还包括:
    获取所述训练图像的第二全局特征向量和第二标注向量集合,以及所述训练图像的参考图像描述信息的文本特征向量;
    根据所述第二全局特征向量和所述文本特征向量训练所述匹配模型。
  6. 根据权利要求5所述的方法,所述方法还包括:
    通过训练得到的匹配模型生成所述训练图像的第二多模态特征向量;
    输入所述第二多模态特征向量、所述第二全局特征向量和所述第二标注向量集合至计算模型,得到所述训练图像的图像描述信息;
    若所述参考图像描述信息和所述训练图像的图像描述信息不匹配,则根据所述训练图像的图像描述信息和所述参考图像描述信息训练所述计算模型。
  7. 一种模型训练方法,应用于计算设备,用于训练如权利要求1至6任一所述的所述匹配模型和计算模型,所述方法包括:
    获取训练图像的全局特征向量和标注向量集合,以及所述训练图像的参考图像描述信息的文本特征向量;
    根据所述全局特征向量和所述文本特征向量训练匹配模型。
  8. 根据权利要求7所述的方法,所述方法还包括:
    通过训练得到的匹配模型生成所述训练图像的多模态特征向量;
    输入所述多模态特征向量、所述全局特征向量和所述标注向量集合至计算模型,得到所述训练图像的图像描述信息;
    若所述参考图像描述信息和所述训练图像的图像描述信息不匹配,则根据所述图像描述信息和所述参考图像描述信息训练所述计算模型。
  9. 根据权利要求8所述的方法,所述计算模型包括n个深度网络,n为正整数;
    所述输入所述多模态特征向量、所述全局特征向量和所述标注向量集合至计算模型,得到所述训练图像的图像描述信息,包括:
    根据所述多模态特征向量、所述全局特征向量、所述标注向量集合和所述n个深度网络,生成所述图像描述信息;
    其中,所述n个深度网络中的至少一个深度网络的输入参数包括拼接向量,当第i个深度网络的输入参数包括所述拼接向量时,若i等于1,则所述拼接向量为所述多模态特征向量和所述标注向量集合拼接得到的向量,若i大于1,则所述拼接向量为第i-1个深度网络的输出向量和所述多模态特征向量拼接得到的向量,其中,i大于或等于1,且小于或等于n。
  10. 根据权利要求9所述的方法,所述n=2;
    所述根据所述多模态特征向量、所述全局特征向量、所述标注向量集合和所述n个深度网络,生成所述图像描述信息,包括:
    将所述多模态特征向量和所述标注向量集合拼接,得到第一拼接向量;
    输入所述第一拼接向量和所述全局特征向量至第1个深度网络,得到第一输出向量;
    将所述第一输出向量和所述多模态特征向量拼接,得到第二拼接向量;
    输入所述第二拼接向量至第2个深度网络,得到所述图像描述信息。
  11. 一种生成设备,所述生成设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至6任一所述的图像描述生成方法。
  12. 一种训练设备,所述训练设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求7至10任一所述的模型训练方法。
  13. 一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至6任一所述的图像描述生成方法。
  14. 一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求7至10任一所述的模型训练方法。
PCT/CN2018/102469 2017-08-30 2018-08-27 图像描述生成方法、模型训练方法、设备和存储介质 WO2019042244A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/548,621 US11270160B2 (en) 2017-08-30 2019-08-22 Image description generation method, model training method, device and storage medium
US17/589,726 US11907851B2 (en) 2017-08-30 2022-01-31 Image description generation method, model training method, device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710763735.3 2017-08-30
CN201710763735.3A CN108305296B (zh) 2017-08-30 2017-08-30 图像描述生成方法、模型训练方法、设备和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/548,621 Continuation US11270160B2 (en) 2017-08-30 2019-08-22 Image description generation method, model training method, device and storage medium

Publications (1)

Publication Number Publication Date
WO2019042244A1 true WO2019042244A1 (zh) 2019-03-07

Family

ID=62869528

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102469 WO2019042244A1 (zh) 2017-08-30 2018-08-27 图像描述生成方法、模型训练方法、设备和存储介质

Country Status (4)

Country Link
US (2) US11270160B2 (zh)
CN (2) CN108305296B (zh)
TW (1) TWI803514B (zh)
WO (1) WO2019042244A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008922A (zh) * 2019-04-12 2019-07-12 腾讯科技(深圳)有限公司 用于终端设备的图像处理方法、设备、装置、介质
CN112529104A (zh) * 2020-12-23 2021-03-19 东软睿驰汽车技术(沈阳)有限公司 一种车辆故障预测模型生成方法、故障预测方法及装置
CN112836754A (zh) * 2021-02-05 2021-05-25 方玉明 一种面向图像描述模型泛化能力评估方法
CN114417974A (zh) * 2021-12-22 2022-04-29 北京百度网讯科技有限公司 模型训练方法、信息处理方法、装置、电子设备和介质
CN114743018A (zh) * 2022-04-21 2022-07-12 平安科技(深圳)有限公司 图像描述生成方法、装置、设备及介质
CN115019071A (zh) * 2022-05-19 2022-09-06 昆明理工大学 光学图像与sar图像匹配方法、装置、电子设备及介质
CN114743018B (zh) * 2022-04-21 2024-05-31 平安科技(深圳)有限公司 图像描述生成方法、装置、设备及介质

Families Citing this family (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305296B (zh) 2017-08-30 2021-02-26 深圳市腾讯计算机系统有限公司 图像描述生成方法、模型训练方法、设备和存储介质
CN110163050B (zh) * 2018-07-23 2022-09-27 腾讯科技(深圳)有限公司 一种视频处理方法及装置、终端设备、服务器及存储介质
WO2020019220A1 (zh) * 2018-07-25 2020-01-30 华为技术有限公司 在预览界面中显示业务信息的方法及电子设备
CN108900856B (zh) * 2018-07-26 2020-04-28 腾讯科技(深圳)有限公司 一种视频帧率预测方法、装置及设备
CN109241998B (zh) * 2018-08-06 2020-12-29 百度在线网络技术(北京)有限公司 模型训练方法、装置、设备及存储介质
CN109344288B (zh) * 2018-09-19 2021-09-24 电子科技大学 一种基于多模态特征结合多层注意力机制的结合视频描述方法
US11308133B2 (en) * 2018-09-28 2022-04-19 International Business Machines Corporation Entity matching using visual information
CN111401394B (zh) * 2019-01-02 2023-04-07 中国移动通信有限公司研究院 一种图像标注方法及装置、计算机可读存储介质
CN109920016B (zh) * 2019-03-18 2021-06-25 北京市商汤科技开发有限公司 图像生成方法及装置、电子设备和存储介质
CN109947526B (zh) * 2019-03-29 2023-04-11 北京百度网讯科技有限公司 用于输出信息的方法和装置
CN111832584A (zh) * 2019-04-16 2020-10-27 富士通株式会社 图像处理装置及其训练装置和训练方法
CN110188620B (zh) * 2019-05-08 2022-11-04 腾讯科技(深圳)有限公司 对抗测试看图说话系统的方法和相关装置
US11315038B2 (en) * 2019-05-16 2022-04-26 International Business Machines Corporation Method to measure similarity of datasets for given AI task
CN113743535B (zh) * 2019-05-21 2024-05-24 北京市商汤科技开发有限公司 神经网络训练方法及装置以及图像处理方法及装置
CN110349229B (zh) * 2019-07-09 2023-06-02 北京金山数字娱乐科技有限公司 一种图像描述方法及装置
CN110413814A (zh) * 2019-07-12 2019-11-05 智慧芽信息科技(苏州)有限公司 图像数据库建立方法、搜索方法、电子设备和存储介质
CN110443863B (zh) * 2019-07-23 2023-04-07 中国科学院深圳先进技术研究院 文本生成图像的方法、电子设备和存储介质
US11023783B2 (en) * 2019-09-11 2021-06-01 International Business Machines Corporation Network architecture search with global optimization
US10943353B1 (en) 2019-09-11 2021-03-09 International Business Machines Corporation Handling untrainable conditions in a network architecture search
CN112488144B (zh) * 2019-09-12 2024-03-19 中国移动通信集团广东有限公司 网络设置提示生成方法、装置及电子设备、存储介质
US11429809B2 (en) 2019-09-24 2022-08-30 Beijing Sensetime Technology Development Co., Ltd Image processing method, image processing device, and storage medium
CN110853653B (zh) * 2019-11-21 2022-04-12 中科智云科技有限公司 一种基于自注意力和迁移学习的声纹识别方法
CN113094538A (zh) * 2019-12-23 2021-07-09 中国电信股份有限公司 图像的检索方法、装置和计算机可读存储介质
CN111160275B (zh) * 2019-12-30 2023-06-23 深圳元戎启行科技有限公司 行人重识别模型训练方法、装置、计算机设备和存储介质
US11410083B2 (en) * 2020-01-07 2022-08-09 International Business Machines Corporation Determining operating range of hyperparameters
CN113139566B (zh) * 2020-01-20 2024-03-12 北京达佳互联信息技术有限公司 图像生成模型的训练方法及装置、图像处理方法及装置
CN111368898B (zh) * 2020-02-28 2022-10-25 同济大学 一种基于长短时记忆网络变体的图像描述生成方法
CN111340195B (zh) * 2020-03-09 2023-08-22 创新奇智(上海)科技有限公司 网络模型的训练方法及装置、图像处理方法及存储介质
CN111444367B (zh) * 2020-03-24 2022-10-14 哈尔滨工程大学 一种基于全局与局部注意力机制的图像标题生成方法
US11620475B2 (en) * 2020-03-25 2023-04-04 Ford Global Technologies, Llc Domain translation network for performing image translation
CN111753825A (zh) * 2020-03-27 2020-10-09 北京京东尚科信息技术有限公司 图像描述生成方法、装置、系统、介质及电子设备
CN113538604B (zh) * 2020-04-21 2024-03-19 中移(成都)信息通信科技有限公司 图像生成方法、装置、设备及介质
TWI752478B (zh) * 2020-04-27 2022-01-11 台達電子工業股份有限公司 影像處理方法與影像處理系統
CN111598041B (zh) * 2020-05-25 2023-05-02 青岛联合创智科技有限公司 一种用于物品查找的图像生成文本方法
CN111767727B (zh) * 2020-06-24 2024-02-06 北京奇艺世纪科技有限公司 数据处理方法及装置
CN111914672B (zh) * 2020-07-08 2023-08-04 浙江大华技术股份有限公司 图像标注方法和装置及存储介质
CN111881926A (zh) * 2020-08-24 2020-11-03 Oppo广东移动通信有限公司 图像生成、图像生成模型的训练方法、装置、设备及介质
CN112242185A (zh) * 2020-09-09 2021-01-19 山东大学 基于深度学习的医学图像报告自动生成方法及系统
CN112073582B (zh) * 2020-09-09 2021-04-06 中国海洋大学 基于触摸行为序列的智能手机使用情境识别方法
CN112183022A (zh) * 2020-09-25 2021-01-05 北京优全智汇信息技术有限公司 一种估损方法和装置
CN112200268A (zh) * 2020-11-04 2021-01-08 福州大学 一种基于编码器-解码器框架的图像描述方法
CN112487891B (zh) * 2020-11-17 2023-07-18 云南电网有限责任公司 一种应用于电力作业现场的视觉智能动态识别模型构建方法
CN112270163B (zh) * 2020-12-07 2021-09-10 北京沃东天骏信息技术有限公司 一种文本生成方法及装置、存储介质
CN112669262B (zh) * 2020-12-08 2023-01-06 上海交通大学 一种电机轮轴震动异常检测与预测系统与方法
CN112802086A (zh) * 2020-12-30 2021-05-14 深兰人工智能芯片研究院(江苏)有限公司 密度估计方法、装置、电子设备及存储介质
CN112966617B (zh) * 2021-03-11 2022-10-21 北京三快在线科技有限公司 摆盘图像的生成方法、图像生成模型的训练方法及装置
CN112926671B (zh) * 2021-03-12 2024-04-19 云知声智能科技股份有限公司 一种图像文本匹配的方法、装置、电子设备和存储介质
CN113112001A (zh) * 2021-04-01 2021-07-13 北京嘀嘀无限科技发展有限公司 一种充电数据处理方法、装置和电子设备
CN113077456B (zh) * 2021-04-20 2022-01-04 北京大学 基于功能性磁共振成像构建网络模型的训练方法和装置
CN113408430B (zh) * 2021-06-22 2022-09-09 哈尔滨理工大学 基于多级策略和深度强化学习框架的图像中文描述系统及方法
CN113673349B (zh) * 2021-07-20 2022-03-11 广东技术师范大学 基于反馈机制的图像生成中文文本方法、系统及装置
CN113487762B (zh) * 2021-07-22 2023-07-04 东软睿驰汽车技术(沈阳)有限公司 一种编码模型生成方法、充电数据获取方法及装置
US11445267B1 (en) * 2021-07-23 2022-09-13 Mitsubishi Electric Research Laboratories, Inc. Low-latency captioning system
CN113420880B (zh) * 2021-08-24 2021-11-19 苏州浪潮智能科技有限公司 网络模型训练方法、装置、电子设备及可读存储介质
CN113822348A (zh) * 2021-09-13 2021-12-21 深圳中兴网信科技有限公司 模型训练方法、训练装置、电子设备和可读存储介质
CN114298121A (zh) * 2021-10-09 2022-04-08 腾讯科技(深圳)有限公司 基于多模态的文本生成方法、模型训练方法和装置
CN114399629A (zh) * 2021-12-22 2022-04-26 北京沃东天骏信息技术有限公司 一种目标检测模型的训练方法、目标检测的方法和装置
CN114003758B (zh) * 2021-12-30 2022-03-08 航天宏康智能科技(北京)有限公司 图像检索模型的训练方法和装置以及检索方法和装置
CN114881242B (zh) * 2022-04-21 2023-03-24 西南石油大学 一种基于深度学习的图像描述方法及系统、介质和电子设备
CN115359323B (zh) * 2022-08-31 2023-04-25 北京百度网讯科技有限公司 图像的文本信息生成方法和深度学习模型的训练方法
CN115587347A (zh) * 2022-09-28 2023-01-10 支付宝(杭州)信息技术有限公司 虚拟世界的内容处理方法及装置
CN115856425B (zh) * 2022-11-21 2023-10-17 中国人民解放军32802部队 一种基于隐空间概率预测的频谱异常检测方法及装置
CN115512006B (zh) * 2022-11-23 2023-04-07 有米科技股份有限公司 基于多图像元素的图像智能合成方法及装置
CN116819489A (zh) * 2023-08-25 2023-09-29 摩尔线程智能科技(北京)有限责任公司 动态物体检测方法、模型训练方法、装置、设备及介质
CN117710234B (zh) * 2024-02-06 2024-05-24 青岛海尔科技有限公司 基于大模型的图片生成方法、装置、设备和介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070245400A1 (en) * 1998-11-06 2007-10-18 Seungyup Paek Video description system and method
CN105631468A (zh) * 2015-12-18 2016-06-01 华南理工大学 一种基于rnn的图片描述自动生成方法
CN105760507A (zh) * 2016-02-23 2016-07-13 复旦大学 基于深度学习的跨模态主题相关性建模方法
CN106777125A (zh) * 2016-12-16 2017-05-31 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种基于神经网络及图像关注点的图像描述生成方法
CN106846306A (zh) * 2017-01-13 2017-06-13 重庆邮电大学 一种超声图像自动描述方法和系统
CN108305296A (zh) * 2017-08-30 2018-07-20 深圳市腾讯计算机系统有限公司 图像描述生成方法、模型训练方法、设备和存储介质

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902821B (zh) * 2012-11-01 2015-08-12 北京邮电大学 基于网络热点话题的图像高级语义标注、检索方法及装置
CN105005755B (zh) * 2014-04-25 2019-03-29 北京邮电大学 三维人脸识别方法和系统
US9311342B1 (en) * 2015-02-09 2016-04-12 Sony Corporation Tree based image storage system
CN104778272B (zh) * 2015-04-24 2018-03-02 西安交通大学 一种基于区域挖掘和空间编码的图像位置估计方法
CN106326288B (zh) * 2015-06-30 2019-12-03 阿里巴巴集团控股有限公司 图像搜索方法及装置
US9836671B2 (en) * 2015-08-28 2017-12-05 Microsoft Technology Licensing, Llc Discovery of semantic similarities between images and text
CN105389326B (zh) * 2015-09-16 2018-08-31 中国科学院计算技术研究所 基于弱匹配概率典型相关性模型的图像标注方法
US10423874B2 (en) * 2015-10-02 2019-09-24 Baidu Usa Llc Intelligent image captioning
US10504010B2 (en) * 2015-10-02 2019-12-10 Baidu Usa Llc Systems and methods for fast novel visual concept learning from sentence descriptions of images
CN105701502B (zh) * 2016-01-06 2020-11-10 福州大学 一种基于蒙特卡罗数据均衡的图像自动标注方法
US9811765B2 (en) * 2016-01-13 2017-11-07 Adobe Systems Incorporated Image captioning with weak supervision
CN105893573B (zh) * 2016-03-31 2019-07-23 天津大学 一种基于地点的多模态媒体数据主题提取模型
CN106446782A (zh) * 2016-08-29 2017-02-22 北京小米移动软件有限公司 图像识别方法及装置
CN106650789B (zh) * 2016-11-16 2023-04-07 同济大学 一种基于深度lstm网络的图像描述生成方法
CN106844442A (zh) * 2016-12-16 2017-06-13 广东顺德中山大学卡内基梅隆大学国际联合研究院 基于fcn特征提取的多模态循环神经网络图像描述方法
CN107066973B (zh) * 2017-04-17 2020-07-21 杭州电子科技大学 一种利用时空注意力模型的视频内容描述方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070245400A1 (en) * 1998-11-06 2007-10-18 Seungyup Paek Video description system and method
CN105631468A (zh) * 2015-12-18 2016-06-01 华南理工大学 一种基于rnn的图片描述自动生成方法
CN105760507A (zh) * 2016-02-23 2016-07-13 复旦大学 基于深度学习的跨模态主题相关性建模方法
CN106777125A (zh) * 2016-12-16 2017-05-31 广东顺德中山大学卡内基梅隆大学国际联合研究院 一种基于神经网络及图像关注点的图像描述生成方法
CN106846306A (zh) * 2017-01-13 2017-06-13 重庆邮电大学 一种超声图像自动描述方法和系统
CN108305296A (zh) * 2017-08-30 2018-07-20 深圳市腾讯计算机系统有限公司 图像描述生成方法、模型训练方法、设备和存储介质

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008922A (zh) * 2019-04-12 2019-07-12 腾讯科技(深圳)有限公司 用于终端设备的图像处理方法、设备、装置、介质
CN112529104A (zh) * 2020-12-23 2021-03-19 东软睿驰汽车技术(沈阳)有限公司 一种车辆故障预测模型生成方法、故障预测方法及装置
CN112836754A (zh) * 2021-02-05 2021-05-25 方玉明 一种面向图像描述模型泛化能力评估方法
CN114417974A (zh) * 2021-12-22 2022-04-29 北京百度网讯科技有限公司 模型训练方法、信息处理方法、装置、电子设备和介质
CN114417974B (zh) * 2021-12-22 2023-06-20 北京百度网讯科技有限公司 模型训练方法、信息处理方法、装置、电子设备和介质
CN114743018A (zh) * 2022-04-21 2022-07-12 平安科技(深圳)有限公司 图像描述生成方法、装置、设备及介质
CN114743018B (zh) * 2022-04-21 2024-05-31 平安科技(深圳)有限公司 图像描述生成方法、装置、设备及介质
CN115019071A (zh) * 2022-05-19 2022-09-06 昆明理工大学 光学图像与sar图像匹配方法、装置、电子设备及介质
CN115019071B (zh) * 2022-05-19 2023-09-19 昆明理工大学 光学图像与sar图像匹配方法、装置、电子设备及介质

Also Published As

Publication number Publication date
CN110599557A (zh) 2019-12-20
TWI803514B (zh) 2023-06-01
US20220156518A1 (en) 2022-05-19
CN110599557B (zh) 2022-11-18
US11907851B2 (en) 2024-02-20
CN108305296A (zh) 2018-07-20
CN108305296B (zh) 2021-02-26
TW201843654A (zh) 2018-12-16
US11270160B2 (en) 2022-03-08
US20190377979A1 (en) 2019-12-12

Similar Documents

Publication Publication Date Title
WO2019042244A1 (zh) 图像描述生成方法、模型训练方法、设备和存储介质
KR102360659B1 (ko) 기계번역 방법, 장치, 컴퓨터 기기 및 기억매체
US10956771B2 (en) Image recognition method, terminal, and storage medium
WO2018107921A1 (zh) 回答语句确定方法及服务器
CN110334360B (zh) 机器翻译方法及装置、电子设备及存储介质
CN108228270B (zh) 启动资源加载方法及装置
WO2020147369A1 (zh) 自然语言处理方法、训练方法及数据处理设备
CN110827826B (zh) 语音转换文字方法、电子设备
WO2017088434A1 (zh) 人脸模型矩阵训练方法、装置及存储介质
CN111159338A (zh) 一种恶意文本的检测方法、装置、电子设备及存储介质
CN111090489B (zh) 一种信息控制方法及电子设备
KR20200106703A (ko) 사용자 선택 기반의 정보를 제공하는 방법 및 장치
CN115237618A (zh) 请求处理方法、装置、计算机设备及可读存储介质
CN112488157A (zh) 一种对话状态追踪方法、装置、电子设备及存储介质
US20220287110A1 (en) Electronic device and method for connecting device thereof
US11308965B2 (en) Voice information processing method and apparatus, and terminal
CN109347721B (zh) 一种信息发送方法及终端设备
US20230186031A1 (en) Electronic device for providing voice recognition service using user data and operating method thereof
US20220319499A1 (en) Electronic device for processing user utterance and controlling method thereof
CN117216553A (zh) 推荐模型的预训练方法、调整方法、推荐方法及相关产品
CN114281929A (zh) 一种数据处理方法和相关装置
KR20230064504A (ko) 음성 인식 서비스를 제공하는 전자 장치 및 이의 동작 방법
CN112579734A (zh) 一种发音预测方法及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18850300

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18850300

Country of ref document: EP

Kind code of ref document: A1