WO2020073700A1 - 图像描述模型的训练方法、装置及存储介质 - Google Patents
图像描述模型的训练方法、装置及存储介质 Download PDFInfo
- Publication number
- WO2020073700A1 WO2020073700A1 PCT/CN2019/094891 CN2019094891W WO2020073700A1 WO 2020073700 A1 WO2020073700 A1 WO 2020073700A1 CN 2019094891 W CN2019094891 W CN 2019094891W WO 2020073700 A1 WO2020073700 A1 WO 2020073700A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- image
- recursive
- sentence
- decoding
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 69
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000003860 storage Methods 0.000 title claims description 24
- 238000013528 artificial neural network Methods 0.000 claims abstract description 283
- 239000013598 vector Substances 0.000 claims abstract description 124
- 238000009826 distribution Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 14
- 230000009467 reduction Effects 0.000 claims description 14
- 238000001514 detection method Methods 0.000 claims description 13
- 230000015654 memory Effects 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 8
- 238000010586 diagram Methods 0.000 description 12
- 238000013473 artificial intelligence Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 10
- 238000013527 convolutional neural network Methods 0.000 description 9
- 239000011159 matrix material Substances 0.000 description 9
- 230000006870 function Effects 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 210000002569 neuron Anatomy 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000001771 impaired effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000000653 nervous system Anatomy 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
Definitions
- the present application relates to the field of artificial intelligence technology, and in particular, to an image description model training method, device, and storage medium.
- Image description refers to the automatic generation of a descriptive text based on the image, that is, speaking by looking at the picture.
- Image description In order to generate the descriptive text corresponding to the image, you first need to detect the objects in the image, understand the mutual relationship between the objects, and then express it in reasonable language.
- Image description technology can be used in image retrieval services to help visually impaired people understand images. It can also be used for image scene classification and automatic summary classification of images in user albums. Image description technology can also be used for teaching infants and young children, helping infants and young children learn to speak and recognize objects and behaviors in images.
- manually labeled image-sentence pairs can be used to train the image description model.
- Sentence data without correspondence can be used to train a language model, and a separate image set can also be used to train an object recognition model.
- Some embodiments of the present application provide an image description model training method, device, and storage medium, to avoid dependence on paired image samples and sentence samples, and to improve the accuracy of image description.
- An embodiment of the present application provides an image description model training method, which is executed by an electronic device.
- the image description model includes a convolutional encoding neural network and a recursive decoding neural network; the method includes:
- An embodiment of the present application provides an apparatus for training an image description model.
- the image description model includes a convolutional encoding neural network and a recursive decoding neural network.
- the apparatus includes:
- An encoding module configured to obtain image feature vectors of image samples through the convolutional encoding neural network
- a decoding module through the recursive decoding neural network, decodes the image feature vector to obtain a sentence describing the image sample;
- An adjustment module configured to determine the degree of matching between the decoded sentence and the image sample, adjust the recursive decoding neural network according to the degree of matching; and determine the smoothness of the decoded sentence, The recursive decoding neural network is adjusted according to the smoothness.
- An embodiment of the present application also provides an electronic device.
- the electronic device includes:
- a memory connected to the processor; the memory stores machine-readable instructions, and the machine-readable instructions may be executed by the processor to perform training of any image description model provided by the embodiments of the present application Steps in the method.
- An embodiment of the present application further provides a non-volatile computer-readable storage medium, wherein the storage medium stores machine-readable instructions, and the machine-readable instructions may be executed by a processor to complete the foregoing method.
- the recursive decoding neural network is trained and adjusted according to the smoothness of the sentences decoded by the recursive decoding neural network and the matching degree between the decoded sentences and the image samples .
- the paired image samples and sentence samples are not needed as the training set, thereby eliminating the dependence on the paired image samples and sentence samples, expanding the range of the training set and improving the image description. The accuracy of the model.
- FIG. 1 is a schematic diagram of an operating environment in some embodiments of this application.
- FIGS. 2A and 2B are schematic structural diagrams of the model training device 116 in the embodiment of the present application.
- FIG. 3 is a flowchart of an image description model training method provided by some embodiments of the present application.
- 4A and 4B are another flowchart of a training method of an image description model in some embodiments of this application.
- FIG. 5 is a schematic structural diagram of a recursive decoding neural network in some embodiments of this application.
- FIG. 6 is a schematic structural diagram of a recursive discriminant neural network in some embodiments of this application.
- FIG. 7 is a schematic structural diagram of an image description model training device in some embodiments of this application.
- FIG. 8 is another schematic structural diagram of an image description model training device in some embodiments of the present application.
- Deep learning is one of the technology and research fields of machine learning.
- artificial intelligence Artificial Intelligence, AI
- computer systems By establishing artificial neural networks with a hierarchical structure, artificial intelligence (Artificial Intelligence, AI) is implemented in computer systems.
- AI systems refer to computer systems that exhibit intelligent behavior.
- the functions of the AI system include learning, maintaining a large knowledge base, performing reasoning, applying analytical capabilities, discerning the relationship between facts, exchanging ideas with others, understanding the communication of others, and perceiving and understanding the situation.
- AI systems can make machines progress through their own learning and judgment.
- the AI system creates new knowledge by looking for previously unknown patterns in the data. Drive solutions by learning data patterns.
- the recognition rate of the AI system can be improved, and the user's taste can be more accurately understood.
- neural networks are usually used. Neural networks are computer systems designed, constructed, and configured to simulate the human nervous system.
- the neural network architecture consists of an input layer, an output layer, and one or more hidden layers.
- the input layer inputs data into the neural network.
- the output layer produces guess results.
- the hidden layer assists in information dissemination.
- the image description model based on deep learning may adopt an "encoding-decoding" process.
- a convolutional neural network (CNN, Convolutional Neural Network) is used to extract image feature vectors, and the entire image is encoded as a single image.
- Feature vectors with fixed dimensions then use a recurrent neural network (RNN, recurrent neural network) to decode and generate related words one by one in chronological order.
- RNN recurrent neural network
- CNN is a feed-forward neural network that starts from the pixel features of the bottom layer of the image and extracts the features of the image layer by layer. It is a commonly used implementation model for encoders and is responsible for encoding images into vectors.
- RNN is a neural network with fixed weights, external inputs, and internal states. It can be seen as the dynamics of internal states with weights and external inputs as parameters.
- RNN is a common implementation model for decoders, which is responsible for translating image vectors generated by encoders into text descriptions of images.
- both semi-supervised and domain-adapted methods are based on supervised learning methods, adding images and sentences that have no corresponding relationship to achieve the goal of improving results. These methods still require paired images and sentence data to participate in model training.
- Annotating the corresponding sentence description to the image is a very time-consuming and laborious process.
- the training set not only needs to be large enough, but also needs to be as diverse as possible.
- annotating the corresponding sentence description to the image is a very time-consuming and laborious process.
- the size of the training set is reduced, the accuracy of image description will also be reduced.
- embodiments of the present application provide a training method for image description models, which can avoid dependence on paired image samples and sentence samples, expand the range of the training set, and thereby improve the accuracy of the image description model.
- the image description model training method provided by the embodiment of the present application may be executed by any computer device with data processing capabilities, for example, a terminal device or a server. After completing the training of the image description model according to the method provided in the embodiment of the present application, the trained image description model may be applied to the server or terminal device to generate a corresponding description sentence for the specified image, for example, it may be Users provide image retrieval services, automatically classify images in user albums, and so on.
- FIG. 1 is a schematic diagram of an operating environment 100 in some embodiments of this application. As shown in FIG. 1, the image description model training method of the embodiment of the present application may be executed by the model training device 116.
- the model training device 116 is used to train the image description model to obtain a trained image description model, and provide the trained image description model to the server 112 so that the server 112 can
- the terminal device 104 provides image description generation services, such as image retrieval services for users, etc .; or, provides the trained image description model to the terminal device 104, so that the terminal device 104 provides image description generation services for users, such as user albums Automatic classification of images in and so on.
- the model training device 116 may be implemented on one or more independent data processing devices or a distributed computer network, or may be integrated in the server 112 or the terminal device 104.
- the server 112 when the server 112 provides an image description generation service for the terminal device 104, as shown in FIG. 1, multiple users execute through respective terminal devices (eg, terminal devices 104-a to 104-c)
- the applications 108-a to 108-c are connected to the server 112 via the network 106.
- the server 112 provides an image description generation service to the terminal device 104, for example, receives an image provided by a user, generates a corresponding description sentence for the image according to the image description generation model in the server 112, and returns the generated description sentence to the terminal device 104, etc. .
- the terminal device 104 refers to a terminal device with data calculation processing functions, including but not limited to a smartphone (with a communication module installed), a palmtop computer, a tablet computer, a smart TV (Smart TV), and so on.
- Operating systems are installed on these communication terminals, including but not limited to: Android operating system, Symbian operating system, Windows mobile operating system, Apple iPhone OS operating system, etc.
- the network 106 may include a local area network (LAN) and a wide area network (WAN) such as the Internet.
- the network 106 may be implemented using any well-known network protocol, including various wired or wireless protocols.
- the server 112 may be implemented on one or more independent data processing devices or distributed computer networks.
- the model training device 116 provided by the embodiment of the present application will be described below with reference to FIGS. 2A and 2B.
- FIG. 2A and 2B are schematic structural diagrams of a model training device 116 in some embodiments of this application.
- the model training device 116 includes a convolutional coding neural network 201 and a recursive decoding neural network 202.
- the convolutional coding neural network 201 is used to extract the image feature vector 214 for the input image sample 211.
- CNN may be used as the convolutional coding neural network 201 to extract the image feature vector 214 of the image sample 211.
- CNN may be used as the convolutional coding neural network 201 to extract the image feature vector 214 of the image sample 211.
- Inception-X series network, ResNet series network, etc. can be used.
- the convolutional coding neural network 201 may be a pre-trained coding network.
- the recursive decoding neural network 202 is used to decode the image feature vector 214 into a sentence 213 describing the image sample.
- the recursive decoding neural network 202 may be implemented by a long-term short-term memory network (LSTM, Long Short Term Memory), or by other means such as a recursive gate unit (GRU, Gate Recurrent Unit).
- LSTM long-term short-term memory network
- GRU Gate Recurrent Unit
- LSTM is a time-recursive neural network used to process and predict important events with relatively long intervals or delays in time series. It belongs to a special kind of RNN.
- the training of the recursive decoding neural network 202 is implemented in the following manner:
- the model training device 116 in order to implement the training of the recursive decoding neural network 202, the model training device 116 further includes: a recursive discrimination neural network 203 and an adjustment module 204.
- the recursive discriminant neural network 203 is used to discriminate the smoothness of the sentence after the recursive decoding neural network 202 outputs a sentence, and provides the judgment result to the adjustment module 204, and the adjustment module 204 determines the judgment result of the neural network 203 according to the recursive judgment To adjust the recursive decoding neural network 202.
- the adjustment module 204 is used to adjust the recursive decoding neural network 202, including adjusting the recursive decoding neural network 202 according to the degree of matching between the decoded sentence 213 and the image sample 211, and according to the decoding Adjust the recursive decoding neural network 202.
- the adjustment module 204 also performs training adjustment on the recursive discriminant neural network 203 so that the recursive discriminant neural network 203 can more accurately recognize the smoothness of the input sentence.
- the adjustment module 204 may include: a discrimination reward unit 2041, an object reward unit 2042, and a discrimination loss unit 2043.
- the discriminant reward unit 2041 and the object reward unit 2042 are used to adjust the recursive decoding neural network 202
- the discriminant loss unit 2043 is used to adjust the recursive discriminating neural network 203.
- the discriminating reward unit 2041 is configured to adjust the recursive decoding neural network 202 according to the smoothness of the decoded sentence determined by the recursive discriminating neural network 203.
- the smoothness may be a score for sentences decoded by a recurrent decoding neural network.
- the object reward unit 2042 is used to adjust the recursive decoding neural network 202 according to the matching degree between each word in the sentence decoded by the recursive decoding neural network 202 and the image sample 211.
- the objects contained in the image and the score corresponding to each object may be detected according to the object detection model in advance. Then, the object reward unit 2042 compares the words contained in the decoded sentence with the objects detected by the object detection model to obtain the matching degree between the decoded word and the image sample 211, and then recursively decodes the neural network according to the association degree
- the network 202 performs adjustment and optimization.
- the discriminant loss unit 2043 is used to determine the discriminant loss according to the discriminant result of the sentence sample 212 decoded by the recursive discriminant neural network 203 and the sentence 213 obtained by the recursive decoding neural network 202, and adjust the recursive discriminant neural network 203 according to the discriminant loss .
- the recursive discriminant neural network 203 in order to enable the recursive discriminant neural network 203 to perform the discriminatory operation as accurately as possible, it is also necessary to train the recursive discriminant neural network 203 by using manually written sentence samples 212, so that when a manually written very smooth sentence When the sample 212 is input to the recursive discriminant neural network, the recursive discriminant neural network 203 outputs 1. When a sentence recursively decoded by the neural network 202 is input, the recursive discriminant neural network outputs any value between 0 and 1. The higher the value of the recursive discriminant neural network output, the higher the smoothness of the sentence, that is, the smoother the sentence.
- the sentence samples 212 and the sentence 213 decoded by the recursive decoding neural network 202 may be input to the recursive discriminant neural network 203, and the sentence sample 212 and the sentence 213 are scored according to the recursive discriminant neural network 203 to obtain a discriminant loss , And adjust the recursive discriminant neural network according to the obtained discriminant loss.
- the adjustment module 204 may further include: The construction reward unit 2044, the sentence reconstruction loss unit 2045, and the image reconstruction loss unit 2046.
- the image reconstruction reward unit 2044 is configured to reconstruct the image according to the sentence obtained by decoding, and determine the similarity between the reconstructed image and the image sample 211, and perform the recursive decoding neural network 202 according to the similarity Optimize and adjust.
- the neural network 203 can be discriminated recursively to obtain the sentence feature vector corresponding to the decoded sentence, map the sentence feature vector to the image feature space, obtain the corresponding image feature vector, and convolutionally encode the neural
- the image feature vector 214 obtained by the network 201 is compared to obtain the similarity.
- the image reconstruction reward unit 2044 may map the sentence feature vector to the image feature space through a fully connected layer, so as to obtain the image feature vector after reconstruction.
- the sentence reconstruction loss unit 2045 is used to obtain the sentence feature vector corresponding to the sentence sample 212, and map the sentence feature vector corresponding to the sentence sample 212 to the image feature space through the fully connected layer, thereby obtaining the corresponding image feature vector , Input the image feature vector to the recursive decoding neural network 202 to reconstruct a sentence, compare the reconstructed sentence with sentence samples 212, and perform the recursive decoding neural network 202 according to the result of the comparison Optimize and adjust.
- the image reconstruction loss unit 2046 is configured to reconstruct an image according to a sentence obtained by decoding, determine a degree of difference between the reconstructed image and the image sample, and determine the recursion according to the degree of difference Neural network optimization and adjustment.
- the image reconstruction loss unit 2046 can obtain the sentence feature vector corresponding to the decoded sentence through the recursive discriminant neural network 203; map the sentence feature vector to the image feature space, Obtain a corresponding image feature vector, compare the image feature vector obtained by mapping with the image feature vector 214 obtained by the convolutional coding neural network 201 to obtain the degree of difference between the reconstructed image and the image sample, That is, the image reconstruction loss, and the recursive discriminant neural network 203 is optimally adjusted according to the image reconstruction loss.
- the adjustment module 204 can optimize and adjust the recursive decoding neural network 202 and the recursive discrimination neural network 203, thereby improving the accuracy of the recursive decoding neural network 202 and the recursive discrimination neural network 203.
- FIG. 3 is a flowchart of an image description model training method provided by an embodiment of the present application.
- the image description model includes a convolutional coding neural network and a recursive decoding neural network. As shown in FIG. 3, the method includes the following steps:
- S301 Obtain an image feature vector of an image sample through the convolutional coding neural network.
- a pre-trained convolutional coding neural network may be used to obtain image feature vectors corresponding to image samples.
- the image feature vector may be further subjected to dimensionality reduction processing to obtain the dimensionality-reduced image feature vector.
- S302 Decode the image feature vector through the recursive decoding neural network to obtain a sentence describing the image sample.
- the dimensionality-reduced image feature vector may be input to the recursive decoding neural network, and the recursive-decoding neural network decodes the dimensionality-reduced image feature vector to obtain the A sentence describing the image sample.
- the word corresponding to the maximum probability value in the probability distribution is selected in the word list to form a sentence for describing the image sample.
- S303 Determine the matching degree between the sentence obtained by decoding and the image sample, and adjust the recursive decoding neural network according to the matching degree.
- each object contained in the image sample and the corresponding weight of each object may be determined according to the detection result of the image sample by the object detection model;
- S304 Determine the smoothness of the decoded sentence, and adjust the recursive decoding neural network according to the smoothness.
- the smoothness of the decoded sentence can be determined by various methods.
- the decoded sentence may be input to a recursive discriminant neural network, and the smoothness of the decoded sentence may be determined according to the first output of the recursive discriminant neural network at various times.
- the more fluent the sentence decoded by the recursive decoding neural network the more the recursive discriminant neural network will think that the sentence is a manually written fluent sentence, that is, the higher the fluency.
- the recursive discrimination neural network in order to improve the discrimination accuracy of the recursive discrimination neural network, the recursive discrimination neural network needs to be trained according to sentence samples.
- sentence samples can be input to the recursive discriminant neural network to obtain the second output of the recursive discriminant neural network at various times;
- the recursive decoding neural network is trained and adjusted according to the smoothness of the sentences decoded by the recursive decoding neural network and the matching degree between the decoded sentences and image samples .
- the paired image samples and sentence samples are not needed as the training set, thereby eliminating the dependence on the paired image samples and sentence samples, expanding the range of the training set and improving the image description. The accuracy of the model.
- FIG. 4 is a flowchart of another image description model training method in an embodiment of the present application.
- the training method of the image description model provided by the embodiment of the present application will be described below with reference to FIG. 4. As shown in FIG. 4, the method includes the following steps:
- S401 Obtain an image feature vector of an image sample through a convolutional coding neural network.
- the image feature vector of the image sample may be obtained through the convolutional coding neural network 201 shown in FIG. 2.
- the convolutional coding neural network is Inception-V4 convolutional neural network as an example for description.
- the vector output from the pooling layer (Average Pooling layer) of the Inception-V4 convolutional neural network may be used as the image feature vector.
- the operation of the convolutional coding neural network can be expressed by the following formula (1):
- I represents an image sample input to the convolutional coding neural network
- CNN represents the convolutional coding neural network
- f im is the obtained image feature vector
- the dimension of f im may be 1536.
- S402 Perform dimensionality reduction processing on the image feature vectors of the image samples to obtain image feature vectors after dimensionality reduction.
- the image feature vector f im may be subjected to dimensionality reduction processing.
- the dimensionality reduction process can be implemented through a fully connected layer, as shown in the following formula (2):
- x -1 represents the image feature vector after dimensionality reduction
- FC () represents the dimensionality reduction operation through the fully connected layer.
- S403 Input the dimensionality-reduced image feature vector into a recursive decoding neural network, and the recursive-decoding neural network decodes the dimensionality-reduced image feature vector to obtain a sentence describing the image sample.
- step S403 includes:
- S4031 Input the image feature vector to the recursive decoding neural network to obtain n probability distributions of the output, where n is a positive integer, indicating the length of the sentence obtained by the decoding.
- the above operation of the LSTM unit may be expressed as the following formula (3):
- x t represents the word vector corresponding to the t-th word, with Represent the hidden state of the LSTM unit at time t and t + 1 respectively;
- p t + 1 represents the probability distribution, that is, the probability corresponding to each word in the word list;
- n represents the length of the sentence.
- the recursive decoding neural network in the embodiment of the present application does not know the sentence corresponding to the image sample, the input of each time of the recursive decoding neural network needs to be obtained according to the words output at the previous time, where
- x t is the word vector corresponding to the word S t output by the LSTM unit at time t , as shown in formula (4):
- W e vector representative of the word (word embedding) processing
- the word vector is a vector used to map words or phrases from the word list to real numbers.
- step S4031 according to the probability distribution of the LSTM unit outputs P t + 1, the selection probability distribution P t 1 corresponding to the maximum probability value output word + t + 1 as the time in the word list.
- the recursive decoding neural network can obtain sentences of length n: S 1 S 2 ... S n .
- the sentence contains multiple words; each time the recursive decoding neural network decodes a word, it can determine the degree of matching between the word and the image sample, that is, whether the image sample contains the corresponding word object.
- step S404 may include the following operations:
- S4041 Determine each object included in the image sample and the weight corresponding to each object according to the detection result of the image sample by the object detection model.
- the object detection model may be used to detect the image sample in advance to obtain the object contained in the image and the score (weight) corresponding to the object.
- S4042 Perform a matching operation on each word in the decoded sentence and each object included in the image sample, and determine the matching degree according to the matching result and the weight corresponding to each object.
- the matching operation can be performed according to the following formula (5) and the degree of association between the words obtained by the recursive decoding neural network and the image samples is determined:
- st represents the word decoded by the recursive decoding neural network
- c i represents an object detected by the object detection model from the image sample
- v i is the corresponding weight of this object
- N c represents the object detection model detected from the image sample
- the number of outgoing objects Represents the degree of matching between the word st and the image sample
- I () represents the indicator function.
- association degree can be used as an object recognition reward and fed back to the recursive decoding neural network, thereby achieving optimization of the recursive decoding neural network.
- S405 Adjust the recursive decoding neural network according to the matching degree.
- the matching degree may be fed back to the recursive decoder neural network, whereby the recursive decoder weights in the neural network weight parameters and the vector W e is corrected word.
- the LSTM unit has three gates: forget gate, input gate, and output gate.
- the forget gate determines how much the state of the unit at the previous moment is retained until the current moment.
- the input gate determines how much of the network input at the current time is saved to the current time.
- the output gate determines how much of the current cell state is output to the current output value of the LSTM.
- the gate is actually a fully connected layer.
- the input is a vector and the output is a real vector between 0 and 1.
- the gate output is 0, any vector multiplied with it will get a 0 vector, which is equivalent to nothing; when the output is 1, any vector multiplied with it will not change anything, which is equivalent to what Can pass.
- the matching degree calculated in step S404 can be fed back to the recursive decoding neural network for adjusting the weight matrix of each gate, so as to modify and adjust the recursive decoding neural network.
- the word vector W e also varies according to the updated parameter matching degree.
- S406 input the decoded sentence into a recursive discriminant neural network, determine the smoothness of the decoded sentence according to the first output of the recursive discriminant neural network at various times, and decode the recursively according to the smoothness Neural network to adjust.
- the decoded sentence may be input into a recursive discriminant neural network, and the smoothness of the decoded sentence may be determined according to the first output of the recursive discriminant neural network at various times.
- the decoded sentence is input to the recursive discriminant neural network, and the smoothness of the decoded sentence is determined according to the first output of the recursive discriminant neural network at various times.
- FIG. 6 is a schematic diagram of a recursive discriminant neural network provided by an embodiment of this application.
- the LSTM unit is used as an example to implement the recursive discriminant neural network.
- the recursive discriminant neural network can also be implemented in other ways, for example, using a GRU unit.
- the operation of the LSTM unit in the recursive discriminant neural network can be as shown in the following formula (6):
- the input of the LSTM unit at time t includes: x t and The output includes: q t and
- x t represents the word vector corresponding to the t-th word, Represents the implicit state of the LSTM unit at time t-1;
- q t represents the score of the smoothness of the sentence at time t, Represents the implicit state of the LSTM unit at time t.
- q t may be a value between 0 and 1.
- the recursive discriminant neural network When inputting a hand-written straight sentence, the recursive discriminant neural network outputs 1. When inputting a recursively decoded sentence generated by the neural network, the recursive discriminant neural network outputs a value between 0 and 1.
- the goal of the recursive decoding neural network is to deceive the recursive discriminating neural network and try to generate sentences that can cause the recursive discriminating neural network to produce 1, so that it can generate sentences that look smooth.
- the smoothness of the decoded sentence may be determined according to the following formula (7).
- n is a positive integer, representing the length of the sentence obtained by decoding.
- after obtaining the smooth degrees may be fed back to the smooth degree of the recursive decoder neural network, the neural network so that recursive decoding parameters and to adjust the word vector W e, thereby increasing the recursive Decoding the accuracy of the neural network.
- the degree of smooth recursive decoder may be used to adjust the weights in the neural network weight matrix each door, and the word vector W e, whereby the neural network is corrected recursive decoder adjustment.
- step S407 may include the following operations:
- S4071 Determine a discriminant loss according to the first output and the second output of the recursive discriminant neural network at various times.
- the discriminant loss can be calculated according to the following formula (8):
- L adv represents the discriminant loss
- q t represents the first output of the recursive discriminant neural network at time t
- l is a positive integer, indicating the length of the sentence sample
- n is a positive integer, representing the length of the decoded sentence.
- the discriminant loss is used to adjust the parameters of the recursive discriminant neural network.
- the discriminant loss can be fed back to the recursive discriminant neural network to adjust the weight matrix of each gate of the LSTM unit in the recursive discriminant neural network, thereby Adjust and correct the recursive discriminant neural network.
- the parameters of the recursive decoding neural network are corrected for the smoothness of the sentence obtained by decoding and the degree of matching with the image samples.
- the recursive discriminant neural network for supervised recursive decoding neural network learning has also been revised and adjusted.
- the method may further include any one or more of the following steps S408 to S410:
- S408 Reconstruct the image according to the sentence obtained by decoding, determine the similarity between the reconstructed image and the image sample, and optimize the recursive decoding neural network according to the similarity.
- the step S408 includes the following operations:
- S4081 Obtain a sentence feature vector corresponding to the decoded sentence through the recursive discrimination neural network.
- a fully connected layer can be added at the end of the recursive discriminant neural network. This fully connected layer can map the sentence feature vector to the image feature space, thereby obtaining the image feature vector corresponding to the reconstructed image.
- the similarity between the reconstructed image and the image sample can be calculated according to the following formula (9):
- r im represents the similarity between the reconstructed image and the image sample
- x -1 represents the global feature representation of the image sample
- x ′ represents the global feature representation of the reconstructed image
- S4084 Modify and adjust the recursive decoding neural network according to the similarity.
- the correction adjustment performed on the recursive decoding neural network is similar to step S405.
- the calculated similarity r im can be fed back to the recursive decoding neural network, so as to adjust the weight matrix and word vector of each gate in the recursive decoding neural network, so as to improve the accuracy of the recursive decoding neural network.
- S409 Reconstruct the image according to the sentence obtained by the decoding, determine the degree of difference between the reconstructed image and the image sample, and modify and adjust the recursive discriminant neural network according to the degree of difference.
- the degree of difference between the reconstructed image and the image sample can be determined according to the following formula (10):
- the degree of difference can be fed back to the recursive discriminant neural network to adjust the weight matrix of each gate in the recursive discriminant neural network, thereby improving the recursive discriminant neural network ’s Precision.
- S410 Obtain a reconstructed sentence corresponding to the sentence sample through the recursive decoding neural network, compare the sentence sample with the reconstructed sentence, and modify and adjust the recursive decoding neural network according to the comparison result.
- the step S410 includes the following operations:
- S4101 Acquire a sentence feature vector corresponding to a sentence sample, map the sentence feature vector to an image feature space, and obtain a corresponding image feature vector.
- S4102 Use the recursive discrimination neural network to obtain a reconstructed sentence corresponding to the image feature vector obtained by the mapping.
- S4103 Compare the sentence sample with the reconstructed sentence, and modify and adjust the recursive decoding neural network according to the comparison result.
- the comparison between the sentence sample and the reconstructed sentence can be achieved by the following formula (11):
- L sen stands for sentence reconstruction loss
- s t stands for the t-th word in the sentence obtained by decoding
- the reconstruction loss L sen After the reconstruction loss L sen is obtained, the reconstruction loss can be fed back to the recursive decoding neural network to adjust the weight matrix and word vector of each gate in the recursive decoding neural network.
- the operation of reconstructing the sentence is similar to the noise reduction automatic encoder.
- two kinds of noise can be added to the input sentence, including randomly removing some words, randomly disrupting the order of some words, and the loss of the noise reduction autoencoder can be as the above formula (11 ) As shown.
- the recursive decoding neural network is trained and adjusted according to the smoothness of the sentences decoded by the recursive decoding neural network and the matching degree between the decoded sentences and the image samples.
- the paired image samples and sentence samples are not needed as the training set, thereby eliminating the dependence on the paired image samples and sentence samples, expanding the range of the training set and improving the image description.
- the accuracy of the model is not needed as the training set, thereby eliminating the dependence on the paired image samples and sentence samples, expanding the range of the training set and improving the image description.
- the recursive decoding neural network and the recursive discrimination neural network are further modified and adjusted, thereby further improving the accuracy of the image description model.
- Beam search is a heuristic graph search algorithm. It is usually used when the solution space of the graph is relatively large. In order to reduce the space and time occupied by the search, at each step of depth expansion, some poor quality knots are cut off. Point, keep some nodes with higher quality.
- the device 700 includes:
- An encoding module 701, configured to obtain image feature vectors of image samples through the convolutional encoding neural network
- the decoding module 702 decodes the image feature vector through the recursive decoding neural network to obtain a sentence describing the image sample;
- the adjusting module 703 is used to determine the matching degree between the decoded sentence and the image sample, adjust the recursive decoding neural network according to the matching degree; and determine the smoothness of the decoded sentence , Adjust the recursive decoding neural network according to the smoothness.
- the apparatus 700 further includes:
- the dimension reduction module 704 is configured to perform dimension reduction processing on the image feature vector after the encoding module 701 obtains the image feature vector of the image sample to obtain the image feature vector after dimension reduction;
- the decoding module 702 is further configured to input the reduced-dimensional image feature vector to the recursive decoding neural network, and the recursive-decoding neural network decodes the reduced-dimensional image feature vector to obtain the A sentence for describing the image sample.
- the sentence contains multiple words; the decoding module 702 is further used to:
- the word corresponding to the maximum probability value in the probability distribution is selected in the word list to form a sentence for describing the image sample.
- the adjustment module 703 further includes: an object reward unit 7031, configured to:
- the adjustment module 703 further includes: a discrimination reward unit 7032, which is used to:
- Input the decoded sentence into a recursive discriminant neural network determine the smoothness of the decoded sentence according to the first output of the recursive discriminant neural network at each time, and decode the recursive neural network according to the smoothness Make adjustments.
- the discriminating reward unit 7032 is further used to:
- adv represents the smoothness
- q t represents the first output of the recursive discriminant neural network at time t
- n is a positive integer, representing the length of the decoded sentence.
- the adjustment module 703 further includes: a loss determination unit 7033, which is used to:
- the discriminant loss unit 7033 is further used to:
- the discriminant loss is determined according to the following formula:
- q t represents the first output of the recursive discriminant neural network at time t
- l is a positive integer representing the length of the sentence sample
- L adv represents the discriminant loss
- n is a positive integer representing the length of the decoded sentence .
- the adjustment module 703 further includes an image reconstruction reward unit 7034, which is used to:
- the recursive decoding neural network is adjusted.
- the image reconstruction reward unit 7034 is further used to:
- the image feature vector obtained by mapping is compared with the image feature vector of the image sample to determine the similarity between the reconstructed image and the image sample.
- the similarity r im between the reconstructed image and the image sample can be calculated according to the above formula (9). Then, the calculated similarity r im can be fed back to the recursive decoding neural network, so as to adjust the weight matrix and word vector of each gate in the recursive decoding neural network, so as to improve the accuracy of the recursive decoding neural network.
- the adjustment module 703 further includes: an image reconstruction loss unit 7035, configured to:
- the recursive discriminant neural network is adjusted.
- the difference degree Lim can be calculated using the above formula (10). After calculating the difference degree Lim , the difference degree can be fed back to the recursive discriminant neural network, so as to adjust the weight matrix of each gate in the recursive discriminant neural network, thereby improving the recursive discriminant neural network. Precision.
- the adjustment module 703 further includes: a sentence reconstruction loss unit 7036, which is used to:
- the recursive decoding neural network Through the recursive decoding neural network, obtain a reconstructed sentence corresponding to the sentence sample, compare the sentence sample with the reconstructed sentence, and modify and adjust the recursive decoding neural network according to the comparison result.
- the sentence reconstruction loss unit 7036 may obtain sentence feature vectors corresponding to sentence samples, map the sentence feature vectors to the space of image features, and obtain corresponding image feature vectors; A reconstructed sentence corresponding to the image feature vector of s; compare the sentence sample with the reconstructed sentence, and adjust the recursive decoding neural network according to the comparison result.
- the sentence reconstruction loss unit 7036 may obtain the sentence reconstruction loss L sen according to the above formula (11). After the reconstruction loss L sen is obtained, the reconstruction loss can be fed back to the recursive decoding neural network to adjust the weight matrix and word vector of each gate in the recursive decoding neural network.
- the recursive decoding neural network is trained and adjusted according to the smoothness of the sentences decoded by the recursive decoding neural network and the matching degree between the decoded sentences and the image samples.
- the paired image samples and sentence samples are not needed as the training set, thereby eliminating the dependence on the paired image samples and sentence samples, expanding the range of the training set and improving the image description.
- the accuracy of the model is not needed as the training set, thereby eliminating the dependence on the paired image samples and sentence samples, expanding the range of the training set and improving the image description.
- the recursive decoding neural network and the recursive discrimination neural network are further modified and adjusted, thereby further improving the accuracy of the image description model.
- FIG. 8 is another schematic structural diagram of an image description model training device in some embodiments of the present application.
- the image description model training device 800 may be the model training device 116 shown in FIG. 1, or may be a component integrated in the model training device 116.
- the image description model training device 800 includes one or more processors (CPU) 802, a network interface 804, a memory 806, and a communication bus 808 for interconnecting these components.
- processors CPU
- network interface 804 a network interface 804
- memory 806 a communication bus 808 for interconnecting these components.
- the network interface 804 is used to implement a network connection between the image description model training apparatus 800 and an external device.
- the image description model training apparatus 800 may further include one or more output devices 812 (eg, one or more visual displays), and / or include one or more input devices 814 (eg, keyboard, mouse, or other input controls) Wait).
- output devices 812 eg, one or more visual displays
- input devices 814 eg, keyboard, mouse, or other input controls
- the memory 806 may be a high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid-state storage devices; or a non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, Or other non-volatile solid-state storage devices.
- a high-speed random access memory such as DRAM, SRAM, DDR RAM, or other random access solid-state storage devices
- non-volatile memory such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, Or other non-volatile solid-state storage devices.
- the memory 806 includes:
- Operating system 816 including programs for handling various basic system services and for performing hardware-related tasks
- An image description model training system 818 which is used to obtain image feature vectors of image samples through the convolutional coding neural network; and to decode the image feature vectors through the recursive decoding neural network to obtain the image description vector Sample sentences; determine the degree of matching between the decoded sentences and the image samples, adjust the recursive decoding neural network according to the degree of matching; determine the smoothness of the decoded sentences, according to The pass smoothness adjusts the recursive decoding neural network.
- the specific operations and functions of the image description model training system 818 may refer to the above method embodiments, and details are not described herein again.
- the functional modules in the embodiments of the present application may be integrated into one processing unit, or each module may exist alone physically, or two or more modules may be integrated into one unit.
- the above integrated unit can be implemented in the form of hardware or software function unit.
- the functional modules of the foregoing embodiments may be located in one terminal or network node, or may be distributed to multiple terminals or network nodes.
- each embodiment of the present application may be implemented by a data processing program executed by a data processing device such as a computer.
- the data processing program constitutes this application.
- the data processing program usually stored in one storage medium is executed by directly reading the program out of the storage medium or by installing or copying the program into a storage device (such as a hard disk and or memory) of the data processing device. Therefore, such a storage medium also constitutes the present application.
- Storage media can use any type of recording method, such as paper storage media (such as paper tape, etc.), magnetic storage media (such as floppy disk, hard disk, flash memory, etc.), optical storage media (such as CD-ROM, etc.), magneto-optical storage media ( Such as MO, etc.).
- the present application also provides a storage medium in which a data processing program is stored, and the data processing program is used to perform any one of the embodiments of the above-mentioned methods of the present application.
- the program may be stored in a computer-readable storage medium, and the storage medium may include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
本申请实施例公开了一种图像描述模型的训练方法,所述图像描述模型包括卷积编码神经网络和递归解码神经网络;所述方法包括:通过所述卷积编码神经网络,获取图像样本的图像特征向量;通过所述递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句;确定所述解码得到的语句与所述图像样本之间的匹配度,根据所述匹配度对所述递归解码神经网络进行调整;确定所述解码得到的语句的通顺度,根据所述通顺度对所述递归解码神经网络进行调整。
Description
本申请要求于2018年10月8日提交中国专利局、申请号为201811167476.9,发明名称为“图像描述模型的训练方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能技术领域,特别涉及一种图像描述模型的训练方法、装置及存储介质。
图像描述(Image Caption),是指根据图像自动生成一段描述性文字,即看图说话。为了生成图像对应的描述性文字,首先需要检测出图像中的物体,理解物体之间的相互关系,然后再用合理的语言表达出来。
图像描述技术,可以用于图像检索服务,帮助视觉障碍者理解图像,也可以用于图像场景分类,以及用户相册中的图像自动总结归类。图像描述技术还可以用于婴幼儿的教学,帮助婴幼儿学习说话和辨认图像中的物体和行为。
在一些技术中,可以采用人工标注的图像-句子对来训练图像描述模型。另外,也可以使用半监督学习的技术,在模型的训练过程中使用没有对应关系的图像和句子。没有对应关系的句子数据可以用来训练一个语言模型,一个单独的图像集也可以用来训练一个物体识别模型。也可以使用域适应的方法,把一个数据域上成对的图像和句子数据,迁移到另外一个数据域上。在目标域上,只使用没有对应关系的图像和句子。
技术内容
本申请一些实施例提供了一种图像描述模型的训练方法、装置及储存介质,以避免对成对的图像样本和语句样本的依赖,提高图像描述的准确性。
本申请实施例提供了一种图像描述模型的训练方法,由电子设备执行,所述图像描述模型包括卷积编码神经网络和递归解码神经网络;所述方法包括:
通过所述卷积编码神经网络,获取图像样本的图像特征向量;
通过所述递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句;
确定所述解码得到的语句与所述图像样本之间的匹配度,根据所述匹配度对所述递归解码神经网络进行调整;
确定所述解码得到的语句的通顺度,根据所述通顺度对所述递归解码神经网络进行调整。
本申请实施例提供了一种图像描述模型的训练装置,所述图像描述模型包括卷积编码神经网络和递归解码神经网络;所述装置包括:
编码模块,用于通过所述卷积编码神经网络,获取图像样本的图像特征向量;
解码模块,通过所述递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句;
调整模块,用于确定所述解码得到的语句与所述图像样本之间的匹配度,根据所述匹配度对所述递归解码神经网络进行调整;以及确定所述解码得到的语句的通顺度,根据所述通顺度对所述递归解码神经网络进行调整。
本申请实施例还提供了一种电子设备,所述电子设备包括:
处理器;
与所述处理器相连接的存储器;所述存储器中存储有机器可读指令,所述机器可读指令可以由处理器执行,以执行本申请实施例所提供的任一种图像描述模型的训练方法中的步骤。
本申请实施例还提供了一种非易失性计算机可读存储介质,其中所述存储介质中存储有机器可读指令,所述机器可读指令可以由处理器执行以完成上述方法。
在本申请实施例提供的技术方案中,根据递归解码神经网络解码得到的语句的通顺度、以及所述解码得到的语句与图像样本之间的匹配度,对所述递归解码神经网络进行训练调整。这样,在图像描述模型的训练过程中,不需要成对的图像样本和语句样本作为训练集合,从而解除对成对的图像样本和语句样本的依赖,扩大了训练集合的范围,提高了图像描述模型的准确性。
附图简要说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。其中:
图1为本申请一些实施例中的操作环境示意图;
图2A和2B为本申请实施例中模型训练装置116的结构示意图。
图3为本申请一些实施例提供的图像描述模型的训练方法的流程图;
图4A和4B为本申请一些实施例中的图像描述模型的训练方法的另一流程图;
图5为本申请一些实施例中递归解码神经网络的结构示意图;
图6为本申请一些实施例中递归判别神经网络的结构示意图;
图7为本申请一些实施例中图像描述模型训练装置的结构示意图;及
图8为本申请一些实施例中图像描述模型训练装置的另一结构示意图。
为使本申请的技术方案及优点更加清楚明白,以下参照附图并举实施例,对本申请进一步详细说明。
为了描述上的简洁和直观,下文通过描述若干代表性的实施例来对本申请的方案进行阐述。但本文并未示出所有实施方式。实施例中大量的细节仅用于帮助理解本申请的方案,本申请的技术方案实现时可以不局限于这些细节。为了避免不必要地模糊了本申请的方案,一些实施方式没有进行细致地描述,而是仅给出了框架。下文中,“包括”是指“包括但不限于”,“根据……”是指“至少根据……,但不限于仅根据……”。说明书和权利要求书中的“包括”是指某种程度上至少包括,应当解释为除了包括之后提到的特征外,其它特征也可以存在。
目前,由于深度学习(Deep Learning,DL)在视觉领域的成功应用,研究者也将其引入到图像描述领域中,采用神经机器翻译的方法生成描述句子。
深度学习是机器学习的技术和研究领域之一,通过建立具有阶层结构的人工神经网络,在计算机系统中实现人工智能(Artificial Intelligence,AI)。
人工智能(AI)系统是指表现出智能行为的计算机系统。AI系统的 功能包括学习、维护大量知识库、执行推理、应用分析能力、辨别事实之间的关系、与他人交流思想、理解他人的交流、以及感知和理解状况等。
与基于规则的智能系统不同,AI系统可以通过自身学习和判断使机器不断进步。AI系统通过查找数据中先前未知的模式来创建新知识。通过学习数据模式来驱动解决方案。在不断使用中,AI系统的识别率可以提高,并且可以更准确的理解用户的品味。
AI系统中,通常使用神经网络。神经网络是设计、构造和配置模拟人类神经系统的计算机系统。神经网络体系结构由输入层、输出层和一个或多个隐藏层组成。输入层将数据输入神经网络。输出层产生猜测结果。隐藏层协助信息传播。这些系统通过研究示例来学习处理任务或作出决定。神经网络或人工神经网络基于称为神经元或人工神经元的连接单元的集合。神经元之间的每个连接可以将信号传输到另一个神经元。
在本申请一些实施例中,基于深度学习的图像描述模型可以采用“编码-解码”的流程,首先使用卷积神经网络(CNN,Convolutional Neural Network)提取图像特征向量,将整幅图像编码为一条维度固定的特征向量;然后使用递归神经网络(RNN,recurrent neural network)进行解码,按时间顺序逐个生成相关单词。
CNN,是一种直接从图像底层的像素特征开始,逐层对图像进行特征提取的前馈神经网络,是编码器常用的实现模型,负责将图像编码成向量。
RNN,是一种具有固定权值、外部输入和内部状态的神经网络,可以将其看作是以权值和外部输入为参数,关于内部状态的行为动力学。RNN是解码器常用的实现模型,负责将编码器生成的图像向量翻译成图像的文字描述。
目前,半监督以及域适应的方法,都是在监督学习的方法上,增加没有对应关系的图像和句子,来达到提升结果的目的。这些方法仍然需要成对的图像和句子数据来参与模型的训练。给图像标注对应的句子描述是一个非常费时费力的过程。训练集合不仅需要足够大,还需要尽量多样化。但是,给图像标注对应的句子描述是一个非常费时费力的过程。而且,如果训练集合的大小减小,图像描述的准确性也将降低。
为此,本申请实施例提供了一种图像描述模型的训练方法,可以避免对成对的图像样本和语句样本的依赖,扩大训练集合的范围,从而提高图像描述模型的准确性。
本申请实施例提供的图像描述模型训练方法可以由任何具有数据处理能力的计算机设备来执行,例如,终端设备或者服务器等等。当根据本申请实施例提供的方法完成图像描述模型的训练之后,可以将训练好的图像描述模型应用在所述服务器或者终端设备,用于为指定的图像生成对应的描述语句,例如,可以为用户提供图像检索服务、为用户相册中的图像自动分类,等等。
图1为本申请一些实施例中的操作环境100的示意图。如图1所示,本申请实施例的图像描述模型训练方法可以由模型训练装置116来执行。
在本申请一些实施例中,所述模型训练装置116用于对图像描述模型进行训练,得到训练好的图像描述模型,并将所述训练好的图像描述模型提供给服务器112,以便服务器112为终端设备104提供图像描述生成服务,例如为用户提供图像检索服务等等;或者,将训练好的图像描述模型提供给终端设备104,以便终端设备104为用户提供图像描述生成服务,例如为用户相册中的图像自动分类等等。
在一些实施例中,所述模型训练装置116可以在一个或多个独立的数据处理装置或分布式计算机网络上实现,也可以集成在所述服务器 112或者终端设备104中。
在一些实施例中,当所述服务器112为终端设备104提供图像描述生成服务时,如图1所示,多个用户通过各自的终端设备(例如终端设备104-a至104-c)上执行的应用108-a至108-c,通过网络106连接至服务器112。服务器112向终端设备104提供图像描述生成服务,例如接收用户提供的图像,根据服务器112中的图像描述生成模型为该图像生成对应的描述语句,并将生成的描述语句返回给终端设备104等等。
在一些实施例中,所述终端设备104是指具有数据计算处理功能的终端设备,包括但不限于(安装有通信模块的)智能手机、掌上电脑、平板电脑、智能电视(Smart TV)等。这些通信终端上都安装有操作系统,包括但不限于:Android操作系统、Symbian操作系统、Windows mobile操作系统、以及苹果iPhone OS操作系统等等。
在一些实施例中,网络106可以包括局域网(LAN)和广域网(WAN)诸如互联网。可以使用任意公知的网络协议来实现网络106,包括各种有线或无线协议。
在一些实施例中,所述服务器112可以在一个或多个独立的数据处理装置或分布式计算机网络上实现。
下面结合图2A和2B对本申请实施例提供的模型训练装置116进行说明。
图2A和2B为本申请一些实施例中模型训练装置116的结构示意图。如图2A所示,所述模型训练装置116包括:卷积编码神经网络201和递归解码神经网络202。
其中,卷积编码神经网络201用于为输入的图像样本211提取图像特征向量214。
在一些实施例中,可以使用CNN作为卷积编码神经网络201,来提 取图像样本211的图像特征向量214。例如,可以使用Inception-X系列网络、ResNet系列网络等。
在一些实施例中,所述卷积编码神经网络201可以是预先训练好的编码网络。
递归解码神经网络202用于将所述图像特征向量214解码成一个用于描述所述图像样本的语句213。在一些实施例中,所述递归解码神经网络202可以通过长短时记忆网络(LSTM,Long Short Term Memory Network)实现,或者利用递归门单元(GRU,Gate Recurrent Unit)等其他方式实现。
LSTM,是一种时间递归神经网络,用于处理和预测时间序列中间隔或者延迟相对较长时间的重要事件,属于一种特殊的RNN。
由于本申请实施例中使用的图像样本没有对应的语句样本,所以在对递归解码神经网络202进行训练时,通过以下方式实现对递归解码神经网络202的训练:
根据递归解码神经网络202解码得到的语句213与所述图像样本211之间的匹配度,对所述递归解码神经网络202进行训练调整;以及
根据所述解码得到的语句213的通顺度,对所述递归解码神经网络202进行训练调整。
参见图2B,在一些实施例中,为了实现对递归解码神经网络202的训练,所述模型训练装置116还进一步包括:递归判别神经网络203和调整模块204。
其中,递归判别神经网络203用于当递归解码神经网络202输出一个语句之后,判别该语句的通顺度,并将判别结果提供给调整模块204,由调整模块204根据递归判别神经网络203的判别结果,对所述递归解码神经网络202进行调整。
调整模块204用于对递归解码神经网络202进行调整,包括根据解码得到的语句213与所述图像样本211之间的匹配度,对所述递归解码神经网络202进行调整,以及根据所述解码得到的语句213的通顺度,对所述递归解码神经网络202进行调整。此外,调整模块204也对递归判别神经网络203进行训练调整,使得递归判别神经网络203能够更准确的识别出输入的语句的通顺度。
在一些实施例中,调整模块204可以包括:判别奖励单元2041、物体奖励单元2042、以及判别损失单元2043。其中,判别奖励单元2041和物体奖励单元2042用于对递归解码神经网络202进行调整,而判别损失单元2043用于对递归判别神经网络203进行调整。
判别奖励单元2041,用于根据递归判别神经网络203确定的所述解码得到的语句的通顺度,对所述递归解码神经网络202进行调整。
在一些实施例中,所述通顺度可以是对递归解码神经网络解码得到的语句的打分。
物体奖励单元2042,用于根据递归解码神经网络202解码得到的语句中各个单词与图像样本211之间的匹配度,对递归解码神经网络202进行调整。
在一些实施例中,针对图像样本211,可以预先根据物体检测模型检测出该图像包含的物体,以及每个物体对应的打分。然后物体奖励单元2042将解码得到的语句中包含的单词与物体检测模型检测出的物体进行比较,从而得到解码得到的单词与图像样本211之间的匹配度,然后根据该关联度对递归解码神经网络202进行调整优化。
判别损失单元2043,用于根据递归判别神经网络203对语句样本212和递归解码神经网络202解码得到的语句213的判别结果,确定判别损失,并根据所述判别损失对递归判别神经网络203进行调整。
在本申请一些实施例中,为了使得递归判别神经网络203能够尽量准确的进行判别操作,还需要利用人工编写的语句样本212来训练递归判别神经网络203,使得当一个人工编写的非常通顺的语句样本212输入到递归判别神经网络时,递归判别神经网络203输出1,当输入一个递归解码神经网络202生成的语句时,递归判别神经网络输出0至1之间的任意数值。递归判别神经网络输出的数值越高,表示语句的通顺度越高,即语句越通顺。
在一些实施例中,可以轮流将语句样本212和递归解码神经网络202解码得到的语句213输入到递归判别神经网络203,根据递归判别神经网络203对语句样本212和语句213的打分,得到判别损失,并根据得到的判别损失来调整所述递归判别神经网络。
在一些实施例中,除了上述的判别奖励单元2041、物体奖励单元2042和判别损失单元2043之外,为了进一步提高所述图像描述模型的准确度,所述调整模块204还可以进一步包括:图像重构奖励单元2044、语句重构损失单元2045、及图像重构损失单元2046。
所述图像重构奖励单元2044,用于根据解码得到的语句重构图像,并确定重构的图像与图像样本211之间的相似度,根据所述相似度对所述递归解码神经网络202进行优化调整。
例如,可以通过递归判别神经网络203,得到与所述解码得到的语句对应的语句特征向量,将所述语句特征向量映射到图像特征的空间,得到对应的图像特征向量,并与卷积编码神经网络201得到的图像特征向量214进行比较,得到所述相似度。
在一些实施例中,图像重构奖励单元2044可以通过全连接层将所述语句特征向量映射到图像特征的空间,从而得到重构之后的图像特征向量。
所述语句重构损失单元2045,用于获取语句样本212对应的语句特征向量,通过全连接层将所述语句样本212对应的语句特征向量映射到图像特征的空间,从而得到对应的图像特征向量,将所述图像特征向量输入到递归解码神经网络202,从而重构出一个语句,将所述重构得到的语句与语句样本212进行比较,根据比较的结果对所述递归解码神经网络202进行优化调整。
所述图像重构损失单元2046,用于根据解码得到的语句重构图像,确定所述重构的图像与所述图像样本之间的差异度,并根据所述差异度,对所述递归判别神经网络进行优化调整。
在一些实施例中,所述图像重构损失单元2046可以通过所述递归判别神经网络203,得到与所述解码得到的语句对应的语句特征向量;将所述语句特征向量映射到图像特征空间,得到对应的图像特征向量,将所述通过映射得到的图像特征向量与卷积编码神经网络201得到的图像特征向量214进行比较得到所述重构的图像与所述图像样本之间的差异度,即图像重构损失,根据所述图像重构损失对所述递归判别神经网络203进行优化调整。
通过以上各个单元,调整模块204可以对递归解码神经网络202和递归判别神经网络203进行优化调整,从而提高递归解码神经网络202和递归判别神经网络203的准确度。
下面结合图3对本申请实施例提供的图像描述模型的训练方法进行说明。图3为本申请实施例提供的一种图像描述模型的训练方法的流程图。所述图像描述模型包括卷积编码神经网络和递归解码神经网络。如图3所示,所述方法包括以下步骤:
S301,通过所述卷积编码神经网络,获取图像样本的图像特征向量。
在一些实施例中,可以采用预先训练好的卷积编码神经网络来获取 图像样本对应的图像特征向量。另外,在获取所述图像样本的图像特征向量之后,可以进一步将所述图像特征向量进行降维处理,得到降维后的图像特征向量。
S302,通过所述递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句。
在一些实施例中,可以将所述降维后的图像特征向量输入到所述递归解码神经网络,所述递归解码神经网络对所述降维后的图像特征向量进行解码,得到所述用于描述所述图像样本的语句。
在一些实施例中,所述语句中包含多个单词;所述通过递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句包括:
将所述图像特征向量输入到所述递归解码神经网络,得到输出的n个概率分布,其中,n为正整数,表示所述解码得到的语句的长度;
对于每个概率分布,分别在单词表中选择所述概率分布中最大概率值对应的单词,组成用于描述所述图像样本的语句。
S303,确定所述解码得到的语句与所述图像样本之间的匹配度,根据所述匹配度对所述递归解码神经网络进行调整。
在一些实施例中,可以根据物体检测模型对所述图像样本的检测结果,确定所述图像样本中包含的各物体以及所述各物体对应的权重;
将解码得到的语句中包含的各个单词,与所述图像样本中包含的所述各物体进行匹配操作,并根据所述匹配结果以及所述各物体对应的权重,确定所述解码得到的语句与所述图像样本之间的匹配度,然后根据所述匹配度对所述递归解码神经网络进行调整。
S304,确定所述解码得到的语句的通顺度,根据所述通顺度对所述递归解码神经网络进行调整。
在一些实施例中,可以通过各种方法来确定所述解码得到的语句的通顺度。例如,可以将所述解码得到的语句输入递归判别神经网络,根据所述递归判别神经网络各个时刻的第一输出,确定所述解码得到的语句的通顺度。
例如,递归解码神经网络解码出来的语句越通顺,则递归判别神经网络越会认为该语句是人工编写的通顺语句,即通顺度越高。
在一些实施例中,为了提高递归判别神经网络的判别准确度,需要根据语句样本对所述递归判别神经网络进行训练。
例如,可以将语句样本输入到所述递归判别神经网络,获取所述递归判别神经网络各个时刻的第二输出;
根据所述递归判别神经网络各个时刻的第一输出和所述第二输出,对所述递归判别神经网络进行调整。
在本申请实施例提供的技术方案中,根据递归解码神经网络解码得到的语句的通顺度、以及所述解码得到的语句与图像样本之间的匹配度,对所述递归解码神经网络进行训练调整。这样,在图像描述模型的训练过程中,不需要成对的图像样本和语句样本作为训练集合,从而解除对成对的图像样本和语句样本的依赖,扩大了训练集合的范围,提高了图像描述模型的准确性。
图4为本申请实施例中另一图像描述模型的训练方法的流程图。下面结合图4对本申请实施例提供的图像描述模型的训练方法进行说明。如图4所示,所述方法包括以下步骤:
S401,通过卷积编码神经网络,获取图像样本的图像特征向量。
在本申请一些实施例中,可以通过图2所示的卷积编码神经网络201来获取图像样本的图像特征向量。
下面,以所述卷积编码神经网络为Inception-V4卷积神经网络为例 进行说明。
在一些实施例中,可以取Inception-V4卷积神经网络的池化层(Average Pooling层)输出的向量作为所述图像特征向量。卷积编码神经网络的操作可以由以下公式(1)表示:
f
im=CNN(I) (1)
其中,I代表输入所述卷积编码神经网络的图像样本,CNN代表所述卷积编码神经网络,f
im是得到的图像特征向量,f
im的维数可以是1536。
S402,对所述图像样本的图像特征向量进行降维处理,得到降维后的图像特征向量。
在一些实施例中,在得到图像样本对应的图像特征向量f
im之后,考虑到后续操作的计算量,可以对所述图像特征向量f
im进行降维处理。
例如,在卷积编码神经网络之后,可以通过一个全连接层实现所述降维处理,如下公式(2)表示:
x
-1=FC(f
im) (2)
其中,x
-1表示降维之后的图像特征向量,FC()表示通过全连接层进行的降维操作。
S403,将降维后的图像特征向量输入递归解码神经网络,所述递归解码神经网络对所述降维后的图像特征向量进行解码,得到用于描述所述图像样本的语句。
图5是本申请一些实施例中递归解码神经网络的结构示意图。在图5所示的实施例中,以所述递归解码神经网络使用LSTM单元实现为例。参见图4和图5,所述步骤S403包括:
S4031,将所述图像特征向量输入到所述递归解码神经网络,得到输出的n个概率分布,其中,n为正整数,表示所述解码得到的语句的长度。
在一些实施例中,所述LSTM单元的上述操作可以如下公式(3)表示:
另外,由于本申请实施例中在训练递归解码神经网络的时候,并不知道图像样本对应的语句,所以递归解码神经网络各个时刻的输入需要根据之前时刻输出的单词得到,其中,
当t=-1时,x
t=x
-1,即x
t为所述降维之后的图像特征向量;
当t∈{0,…,n-1}时,x
t为根据t时刻LSTM单元输出的单词S
t对应的词向量,如公式(4)所示:
x
t=W
eS
t,t∈{0,…,n-1} (4)
其中,W
e代表词向量(word embedding)处理。
所述词向量(Word embedding),是一个用于将来自单词表的单词或短语映射到实数的向量。
S4032,对于每个概率分布,分别在单词表中选择所述概率分布中最大概率值对应的单词,组成用于描述所述图像样本的语句。
例如,根据步骤S4031中LSTM单元输出的所述概率分布p
t+1,在单词表中选择概率分布p
t+1中最大概率值对应的单词作为t+1时刻的输出。
通过上述的操作,递归解码神经网络可以得到长度为n的语句:S
1S
2…S
n。
S404,确定所述解码得到的语句与所述图像样本之间的匹配度。
在一些实施例中,所述语句中包含多个单词;每次递归解码神经网络解码出一个单词,可以判断该单词与图像样本之间的匹配度,即判断图像样本中是否含有该单词对应的物体。
在一些实施例中,步骤S404可以包括以下操作:
S4041,根据物体检测模型对所述图像样本的检测结果,确定所述图像样本中包含的各物体以及所述各物体对应的权重。
在一些实施例中,可以预先利用物体检测模型对图像样本进行检测,得到图像中包含的物体以及这个物体对应的打分(权重)。
S4042,对解码得到的语句中的各个单词,与所述图像样本中包含的所述各物体进行匹配操作,并根据所述匹配结果以及所述各物体对应的权重,确定所述匹配度。
在一些实施例中,可以根据以下公式(5)来进行匹配操作并确定递归解码神经网络得到的单词与图像样本之间的关联度:
其中,s
t表示递归解码神经网络解码得到的单词,c
i表示物体检测模型从图像样本中检测出的一个物体,v
i是这个物体对应的权重,N
c代表物体检测模型从图像样本中检测出的物体的数量;
代表单词s
t与图像样本之间的匹配度;I()代表指示函数。
S405,根据所述匹配度对所述递归解码神经网络进行调整。
在一些实施例中,可以将所述匹配度反馈到所述递归解码神经网络,从而对所述递归解码神经网络中的权重参数和词向量W
e进行修正。
这样,通过不断的训练,可以不断地对递归解码神经网络中的权重参数和词向量W
e做出修正,从而提高所述递归解码神经网络的精度。
对于采用LSTM单元实现的递归解码神经网络,LSTM单元具有三个门:遗忘门(forget gate),输入门(input gate)和输出门(output gate)。其中遗忘门决定了上一时刻的单元状态有多少保留到当前时刻。输入门决定了当前时刻网络的输入有多少保存到当前时刻。输出门决定了当前单元状态有多少输出到LSTM的当前输出值。
门实际上就是一层全连接层,输入是一个向量,输出是一个0到1之间的实数向量。当门输出为0时,任何向量与之相乘都会得到0向量,这就相当于什么都不能通过;输出为1时,任何向量与之相乘都不会有任何改变,这就相当于什么都可以通过。
因此,可以将步骤S404计算得到的匹配度,反馈给递归解码神经网络,用于调整各个门的权重矩阵,从而对所述递归解码神经网络进行修正调整。
另外,在每一次的训练过程中,词向量W
e也根据所述匹配度进行参数更新而发生变化。
S406,将所述解码得到的语句输入递归判别神经网络,根据所述递归判别神经网络各个时刻的第一输出,确定所述解码得到的语句的通顺度,根据所述通顺度对所述递归解码神经网络进行调整。
在一些实施例中,可以将所述解码得到的语句输入递归判别神经网络,根据所述递归判别神经网络各个时刻的第一输出,确定所述解码得到的语句的通顺度。
例如,将所述解码得到的语句输入所述递归判别神经网络,根据所述递归判别神经网络各个时刻的第一输出,来确定所述解码得到的语句的通顺度。
图6为本申请实施例提供的递归判别神经网络的示意图。在图6所示的实施例中,仍以使用LSTM单元实现所述递归判别神经网络为例进行说明。需要注意的是,所述递归判别神经网络也可以采用其他方式实现,例如采用GRU单元来实现。
参见图6,所述递归判别神经网络中LSTM单元的操作可以如以下公式(6)所示:
其中,LSTM单元在t时刻的输入包括:x
t和
输出包括:q
t和
这里,x
t表示第t个单词对应的词向量,
代表LSTM单元在t-1时刻的隐含状态;q
t代表t时刻对语句的通顺度的打分,
代表LSTM单元在t时刻的隐含状态。
在一些实施例中,q
t可以是0至1之间的数值。当输入一个人工编写的通顺的语句时,递归判别神经网络输出1,当输入一个递归解码神经网络生成的句子的时候,递归判别神经网络输出0至1之间的数值。递归解码神经网络的目标是欺骗递归判别神经网络,尽量生成能够让递归判别神经网络产生1的句子,从而能够生成看起来通顺的语句。
在一些实施例中,可以根据如下公式(7)来确定所述解码得到的语句的通顺度。
其中,其中,r
adv代表所述通顺度,q
t代表所述递归判别神经网络在t时刻输出的数值,n为正整数,代表所述解码得到的语句的长度。
在一些实施例中,在得到所述通顺度之后,可以将所述通顺度反馈到所述递归解码神经网络,从而对递归解码神经网络的参数以及词向量W
e进行调整,从而提高所述递归解码神经网络的精度。
这里,与步骤S405类似,所述通顺度可以用于调整所述递归解码神经网络中各个门的权重矩阵,以及所述词向量W
e,从而对所述递归解码神经网络进行修正调整。
S407,将语句样本输入到所述递归判别神经网络,获取所述递归判别神经网络各个时刻的第二输出;根据所述递归判别神经网络各个时刻的第一输出和所述第二输出,对所述递归判别神经网络进行调整。
在一些实施例中,步骤S407可以包括以下操作:
S4071,根据所述递归判别神经网络各个时刻的第一输出和所述第二输出,确定判别损失。
例如,所述判别损失可以根据以下公式(8)计算:
其中,L
adv表示所述判别损失;q
t代表所述递归判别神经网络在t时刻的第一输出,
代表所述递归判别神经网络在t时刻的第二输出;l为正整数,表示所述语句样本的长度;n为正整数,代表所述解码得到的语句的长度。
S4072,根据所述判别损失对所述递归判别神经网络进行调整。
这里,所述判别损失用于对递归判别神经网络的参数进行调整。
例如,对于采用LSTM单元实现所述递归判别神经网络的情况,可以将所述判别损失反馈到所述递归判别神经网络,从而调整所述递归判别神经网络中LSTM单元的各个门的权重矩阵,从而对所述递归判别神经网络进行调整修正。
通过步骤S404至S407,一方面针对解码得到的语句的通顺度、以及与图像样本的匹配度,对递归解码神经网络的参数进行了修正。另一方面,用于监督递归解码神经网络的学习的递归判别神经网络也得到了 修正调整。
为了进一步提高递归解码神经网络以及递归判别神经网络的精度,在一些实施例中,所述方法还可以进一步包括以下步骤S408至S410中的任意一个或多个:
S408,根据解码得到的语句重构图像,确定所述重构图像与所述图像样本之间的相似度,根据所述相似度对所述递归解码神经网络进行优化。
在一些实施例中,所述步骤S408包括以下操作:
S4081,通过所述递归判别神经网络,得到与所述解码得到的语句对应的语句特征向量。
S4082,将所述语句特征向量映射到图像特征空间,得到对应的图像特征向量;
在一些实施例中,可以在所述递归判别神经网络的最后增加一个全连接层,这个全连接层可以把语句特征向量映射到图像特征的空间,从而得到重构图像对应的图像特征向量。
S4083,将所述通过映射得到图像特征向量,与所述图像样本的图像特征向量进行比较,确定所述重构的图像与所述图像样本之间的相似度。
在一些实施例中,可以根据下面的公式(9)来计算所述重构图像与所述图像样本之间的相似度:
S4084,根据所述相似度对所述递归解码神经网络进行修正调整。
这里,对所述递归解码神经网络进行的修正调整,与步骤S405类似。可以将计算得到的相似度r
im反馈到所述递归解码神经网络,从而对递归解码神经网络中各个门的权重矩阵以及词向量进行调整,以便提高所述递归解码神经网络的精度。
S409,根据所述解码得到的语句重构图像,确定所述重构图像与所述图像样本之间的差异度,根据所述差异度对所述递归判别神经网络进行修正调整。
在一些实施例中,所述重构图像与图像样本之间的差异度可以根据以下公式(10)确定:
这里,在计算得到所述差异度之后,可以将所述差异度反馈到所述递归判别神经网络,从而对递归判别神经网络中各个门的权重矩阵进行调整,从而提高所述递归判别神经网络的精度。
S410,通过所述递归解码神经网络,获取与所述语句样本对应的重构语句,将所述语句样本与所述重构语句进行比较,根据比较结果对所述递归解码神经网络进行修正调整。
在一些实施例中,所述步骤S410包括以下操作:
S4101,获取语句样本对应的语句特征向量,将所述语句特征向量映射到图像特征的空间,得到对应的图像特征向量。
S4102,通过所述递归判别神经网络,得到与所述映射得到的图像特征向量对应的重构语句。
S4103,将所述语句样本与所述重构语句进行比较,根据比较结果 对所述递归解码神经网络进行修正调整。
在一些实施例中,可以通过下面的公式(11)来实现对所述语句样本和所述重构语句的比较:
在得到重构损失L
sen之后,可以将该重构损失反馈给递归解码神经网络,从而调整递归解码神经网络中各个门的权重矩阵以及词向量。
在本步骤中,重构语句的操作与降噪自动编码器类似。为了使得得到的图像描述模型更加鲁棒,可以给输入的语句增加两种噪音,其中包括随机的去掉一些单词,随机的打乱部分单词的顺序,降噪自动编码器的损失可以如上公式(11)所示。
通过本申请实施例提供的技术方案,根据递归解码神经网络解码得到的语句的通顺度、以及所述解码得到的语句与图像样本之间的匹配度,对所述递归解码神经网络进行训练调整。这样,在图像描述模型的训练过程中,不需要成对的图像样本和语句样本作为训练集合,从而解除对成对的图像样本和语句样本的依赖,扩大了训练集合的范围,提高了图像描述模型的准确性。
进一步的,又通过图像重构与语句重构操作,对所述递归解码神经网络和所述递归判别神经网络做了进一步的修正调整,从而进一步提高了图像描述模型的准确性。
在完成对图像描述模型的训练之后,用训练好的图像描述模型,使用束搜索(beam search)方法得到对输入图像的语句描述。
束搜索是一种启发式图搜索算法,通常用在图的解空间比较大的情 况下,为了减少搜索所占用的空间和时间,在每一步深度扩展的时候,剪掉一些质量比较差的结点,保留下一些质量较高的结点。
以上对本申请实施例提供的图像描述模型训练方法进行了说明。
下面结合附图对本申请实施例提供的图像描述模型训练装置进行说明。
图7为本申请一些实施例提供的图像描述模型训练装置的一种结构示意图。如图7所示,该装置700包括:
编码模块701,用于通过所述卷积编码神经网络,获取图像样本的图像特征向量;
解码模块702,通过所述递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句;
调整模块703,用于确定所述解码得到的语句与所述图像样本之间的匹配度,根据所述匹配度对所述递归解码神经网络进行调整;以及确定所述解码得到的语句的通顺度,根据所述通顺度对所述递归解码神经网络进行调整。
在一些实施例中,所述装置700进一步包括:
降维模块704,用于在所述编码模块701获取所述图像样本的图像特征向量之后,将所述图像特征向量进行降维处理,得到降维后的图像特征向量;
所述解码模块702进一步用于,将所述降维后的图像特征向量输入到所述递归解码神经网络,所述递归解码神经网络对所述降维后的图像特征向量进行解码,得到所述用于描述所述图像样本的语句。
在一些实施例中,所述语句中包含多个单词;所述解码模块702进一步用于:
将所述图像特征向量输入到所述递归解码神经网络,得到输出的n 个概率分布,其中,n为正整数,表示所述解码得到的语句的长度;
对于每个概率分布,分别在单词表中选择所述概率分布中最大概率值对应的单词,组成用于描述所述图像样本的语句。
在一些实施例中,所述调整模块703进一步包括:物体奖励单元7031,用于:
根据物体检测模型对所述图像样本的检测结果,确定所述图像样本中包含的各物体以及所述各物体对应的权重;
将解码得到的语句中包含的各个单词,与所述图像样本中包含的所述各物体进行匹配操作,并根据所述匹配结果以及所述各物体对应的权重,确定所述解码得到的语句与所述图像样本之间的匹配度,根据所述匹配度对所述递归解码神经网络进行调整。
在一些实施例中,所述调整模块703进一步包括:判别奖励单元7032,用于:
将所述解码得到的语句输入递归判别神经网络,根据所述递归判别神经网络各个时刻的第一输出,确定所述解码得到的语句的通顺度,根据所述通顺度对所述递归解码神经网络进行调整。
在一些实施例中,所述判别奖励单元7032进一步用于:
根据以下公式确定所述通顺度:
其中,r
adv代表所述通顺度,q
t代表所述递归判别神经网络在t时刻的第一输出,n为正整数,代表所述解码得到的语句的长度。
在一些实施例中,所述调整模块703进一步包括:判别损失单元7033,用于:
将语句样本输入到所述递归判别神经网络,获取所述递归判别神经 网络各个时刻的第二输出;
根据所述递归判别神经网络各个时刻的第一输出和所述第二输出,对所述递归判别神经网络进行调整。
在一些实施例中,所述判别损失单元7033进一步用于:
根据所述递归判别神经网络各个时刻的第一输出和所述第二输出,按照以下公式确定判别损失:
根据所述判别损失对所述递归判别神经网络进行调整;
其中,q
t代表所述递归判别神经网络在t时刻的第一输出,
代表所述递归判别神经网络在t时刻的第二输出;l为正整数,表示所述语句样本的长度;L
adv表示所述判别损失;n为正整数,代表所述解码得到的语句的长度。
在一些实施例中,所述调整模块703进一步包括,图像重构奖励单元7034,用于:
根据所述解码得到的语句重构图像;
确定所述重构的图像与所述图像样本之间的相似度;
根据所述相似度,对所述递归解码神经网络进行调整。
在一些实施例中,所述图像重构奖励单元7034进一步用于:
获取与所述解码得到的语句对应的语句特征向量;
将所述语句特征向量映射到图像特征空间,得到对应的图像特征向量;
将所述通过映射得到图像特征向量,与所述图像样本的图像特征向量进行比较,确定所述重构的图像与所述图像样本之间的相似度。
在一些实施例中,可以根据上述公式(9)计算得到所述重构图像与 所述图像样本之间的相似度r
im。然后,可以将计算得到的相似度r
im反馈到所述递归解码神经网络,从而对递归解码神经网络中各个门的权重矩阵以及词向量进行调整,以便提高所述递归解码神经网络的精度。
在一些实施例中,所述调整模块703进一步包括:图像重构损失单元7035,用于:
根据所述解码得到的语句重构图像;
确定所述重构的图像与所述图像样本之间的差异度;
根据所述差异度,对所述递归判别神经网络进行调整。
在一些实施例中,所述差异度L
im可以利用上述公式(10)计算得到。在计算得到所述差异度L
im之后,可以将所述差异度反馈到所述递归判别神经网络,从而对递归判别神经网络中各个门的权重矩阵进行调整,从而提高所述递归判别神经网络的精度。
在一些实施例中,所述调整模块703进一步包括:语句重构损失单元7036,用于:
通过所述递归解码神经网络,获取与所述语句样本对应的重构语句,将所述语句样本与所述重构语句进行比较,根据比较结果对所述递归解码神经网络进行修正调整。
在一些实施例中,所述语句重构损失单元7036可以获取语句样本对应的语句特征向量,将所述语句特征向量映射到图像特征的空间,得到对应的图像特征向量;获取与所述映射得到的图像特征向量对应的重构语句;将所述语句样本与所述重构语句进行比较,根据比较结果对所述递归解码神经网络进行调整。
在一些实施例中,所述语句重构损失单元7036可以根据上述公式(11)得到语句重构损失L
sen。在得到重构损失L
sen之后,可以将该重构损失反馈给递归解码神经网络,从而调整递归解码神经网络中各个门 的权重矩阵以及词向量。
通过本申请实施例提供的技术方案,根据递归解码神经网络解码得到的语句的通顺度、以及所述解码得到的语句与图像样本之间的匹配度,对所述递归解码神经网络进行训练调整。这样,在图像描述模型的训练过程中,不需要成对的图像样本和语句样本作为训练集合,从而解除对成对的图像样本和语句样本的依赖,扩大了训练集合的范围,提高了图像描述模型的准确性。
进一步的,又通过图像重构与语句重构操作,对所述递归解码神经网络和所述递归判别神经网络做了进一步的修正调整,从而进一步提高了图像描述模型的准确性。
在完成对图像描述模型的训练之后,用训练好的图像描述模型,使用束搜索(beam search)方法得到对输入图像的语句描述。
图8是本申请一些实施例中图像描述模型训练装置的另一结构示意图。该图像描述模型训练装置800可以是图1示出的模型训练装置116,也可以是集成于模型训练装置116中的一个组件。
如图8所示,图像描述模型训练装置800包括一个或者多个处理器(CPU)802、网络接口804、存储器806、以及用于互联这些组件的通信总线808。
在一些实施例中,所述网络接口804用于实现所述图像描述模型训练装置800与外部设备之间的网络连接。
所述图像描述模型训练装置800还可以进一步包含一个或多个输出设备812(例如一个或多个可视化显示器),和/或包括一个或多个输入设备814(例如键盘,鼠标,或其他输入控件等)。
存储器806可以是高速随机存取存储器,诸如DRAM、SRAM、DDR RAM、或其他随机存取固态存储设备;或者非易失性存储器,诸如一个 或多个磁盘存储设备、光盘存储设备、闪存设备,或其他非易失性固态存储设备。
存储器806包括:
操作系统816,包括用于处理各种基本系统服务和用于执行硬件相关任务的程序;
图像描述模型训练系统818,用于通过所述卷积编码神经网络,获取图像样本的图像特征向量;通过所述递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句;确定所述解码得到的语句与所述图像样本之间的匹配度,根据所述匹配度对所述递归解码神经网络进行调整;确定所述解码得到的语句的通顺度,根据所述通顺度对所述递归解码神经网络进行调整。
在一些实施例中,所述图像描述模型训练系统818的具体操作和功能可以参见上面的方法实施例,在此不再赘述。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。所述各实施例的功能模块可以位于一个终端或网络节点,或者也可以分布到多个终端或网络节点上。
另外,本申请的每一个实施例可以通过由数据处理设备如计算机执行的数据处理程序来实现。显然,数据处理程序构成了本申请。此外,通常存储在一个存储介质中的数据处理程序通过直接将程序读取出存储介质或者通过将程序安装或复制到数据处理设备的存储设备(如硬盘和或内存)中执行。因此,这样的存储介质也构成了本申请。存储介质可以使用任何类型的记录方式,例如纸张存储介质(如纸带等)、磁存 储介质(如软盘、硬盘、闪存等)、光存储介质(如CD-ROM等)、磁光存储介质(如MO等)等。
因此本申请还提供了一种存储介质,其中存储有数据处理程序,该数据处理程序用于执行本申请上述方法的任何一种实施例。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。
以上所述仅为本申请的较佳实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和原则之内,所作的任何修改、等同替换以及改进等,均应包含在本申请的保护范围之内。
Claims (16)
- 一种图像描述模型的训练方法,由电子设备执行,所述图像描述模型包括卷积编码神经网络和递归解码神经网络;所述方法包括:通过所述卷积编码神经网络,获取图像样本的图像特征向量;通过所述递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句;确定所述解码得到的语句与所述图像样本之间的匹配度,根据所述匹配度对所述递归解码神经网络进行调整;确定所述解码得到的语句的通顺度,根据所述通顺度对所述递归解码神经网络进行调整。
- 根据权利要求1所述的方法,在所述获取图像样本的图像特征向量之后,进一步包括:将所述图像特征向量进行降维处理,得到降维后的图像特征向量;所述通过所述递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句,包括:将所述降维后的图像特征向量输入到所述递归解码神经网络,所述递归解码神经网络对所述降维后的图像特征向量进行解码,得到所述用于描述所述图像样本的语句。
- 根据权利要求1所述的方法,所述通过所述递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句包括:将所述图像特征向量输入到所述递归解码神经网络,得到输出的n个概率分布,其中,n为正整数,表示所述解码得到的语句的长度;对于每个概率分布,分别在单词表中选择所述概率分布中最大概率 值对应的单词,组成用于描述所述图像样本的语句。
- 根据权利要求1所述的方法,所述确定所述解码得到的语句与所述图像样本之间的匹配度包括:根据物体检测模型对所述图像样本的检测结果,确定所述图像样本中包含的各物体以及所述各物体对应的权重;将解码得到的语句中包含的各个单词,与所述图像样本中包含的所述各物体进行匹配操作,并根据所述匹配结果以及所述各物体对应的权重,确定所述匹配度。
- 根据权利要求1所述的方法,所述确定所述解码得到的语句的通顺度包括:将所述解码得到的语句输入递归判别神经网络,根据所述递归判别神经网络各个时刻的第一输出,确定所述解码得到的语句的通顺度。
- 根据权利要求5所述的方法,进一步包括:将语句样本输入到所述递归判别神经网络,获取所述递归判别神经 网络各个时刻的第二输出;根据所述递归判别神经网络各个时刻的第一输出和所述第二输出,对所述递归判别神经网络进行调整。
- 根据权利要求5所述的方法,进一步包括:根据所述解码得到的语句重构图像;确定所述重构的图像与所述图像样本之间的差异度;根据所述差异度,对所述递归判别神经网络进行调整。
- 根据权利要求1所述的方法,进一步包括:根据所述解码得到的语句重构图像;确定所述重构的图像与所述图像样本之间的相似度;根据所述相似度,对所述递归解码神经网络进行调整。
- 根据权利要求10所述的方法,其中,所述根据所述解码得到的语句重构图像包括:获取与所述解码得到的语句对应的语句特征向量;将所述语句特征向量映射到图像特征空间,得到对应的图像特征向量;所述确定所述重构的图像与所述图像样本之间的相似度包括:将所述通过映射得到图像特征向量,与所述图像样本的图像特征向量进行比较,确定所述重构的图像与所述图像样本之间的相似度。
- 根据权利要求1所述的方法,进一步包括:获取语句样本对应的语句特征向量,将所述语句特征向量映射到图像特征的空间,得到对应的图像特征向量;获取与所述映射得到的图像特征向量对应的重构语句;将所述语句样本与所述重构语句进行比较,根据比较结果对所述递归解码神经网络进行调整。
- 一种图像描述模型的训练装置,所述图像描述模型包括卷积编码神经网络和递归解码神经网络;所述装置包括:编码模块,用于通过所述卷积编码神经网络,获取图像样本的图像特征向量;解码模块,通过所述递归解码神经网络,对所述图像特征向量进行解码,得到用于描述所述图像样本的语句;调整模块,用于确定所述解码得到的语句与所述图像样本之间的匹 配度,根据所述匹配度对所述递归解码神经网络进行调整;以及确定所述解码得到的语句的通顺度,根据所述通顺度对所述递归解码神经网络进行调整。
- 根据权利要求13所述的装置,所述调整模块进一步用于,将所述解码得到的语句输入递归判别神经网络,根据所述递归判别神经网络各个时刻的第一输出,确定所述解码得到的语句的通顺度。
- 一种电子设备,包括:处理器;与所述处理器相连接的存储器;所述存储器中存储有机器可读指令,所述机器可读指令可以由处理器执行以完成上述权利要求1至12中任一项所述的方法。
- 一种非易失性计算机可读存储介质,其中所述存储介质中存储有机器可读指令,所述机器可读指令可以由处理器执行以完成权利要求1-12中任一项所述的方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP19870771.3A EP3866068A4 (en) | 2018-10-08 | 2019-07-05 | METHOD AND DEVICE FOR FORMING IMAGE DESCRIPTION MODEL AND INFORMATION HOLDER |
US17/075,618 US12073321B2 (en) | 2018-10-08 | 2020-10-20 | Method and apparatus for training image caption model, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811167476.9A CN110147806B (zh) | 2018-10-08 | 2018-10-08 | 图像描述模型的训练方法、装置及存储介质 |
CN201811167476.9 | 2018-10-08 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/075,618 Continuation US12073321B2 (en) | 2018-10-08 | 2020-10-20 | Method and apparatus for training image caption model, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020073700A1 true WO2020073700A1 (zh) | 2020-04-16 |
Family
ID=67588352
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/094891 WO2020073700A1 (zh) | 2018-10-08 | 2019-07-05 | 图像描述模型的训练方法、装置及存储介质 |
Country Status (4)
Country | Link |
---|---|
US (1) | US12073321B2 (zh) |
EP (1) | EP3866068A4 (zh) |
CN (1) | CN110147806B (zh) |
WO (1) | WO2020073700A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528894A (zh) * | 2020-12-17 | 2021-03-19 | 科大讯飞股份有限公司 | 一种差异项判别方法及装置 |
CN114359942A (zh) * | 2022-01-11 | 2022-04-15 | 平安科技(深圳)有限公司 | 基于人工智能的字幕提取方法、装置、设备和存储介质 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111537853A (zh) * | 2020-06-29 | 2020-08-14 | 国网山东省电力公司菏泽供电公司 | 基于多源异构数据分析的开关柜局部放电智能检测方法 |
CN112183525B (zh) * | 2020-09-15 | 2023-11-24 | 中保车服科技服务股份有限公司 | 一种文本识别模型的构建及文本识别方法和装置 |
CN112927136B (zh) * | 2021-03-05 | 2022-05-10 | 江苏实达迪美数据处理有限公司 | 一种基于卷积神经网络域适应的图像缩小方法及系统 |
CN113657390B (zh) * | 2021-08-13 | 2022-08-12 | 北京百度网讯科技有限公司 | 文本检测模型的训练方法和检测文本方法、装置和设备 |
US12008331B2 (en) * | 2021-12-23 | 2024-06-11 | Microsoft Technology Licensing, Llc | Utilizing visual and textual aspects of images with recommendation systems |
CN114881242B (zh) * | 2022-04-21 | 2023-03-24 | 西南石油大学 | 一种基于深度学习的图像描述方法及系统、介质和电子设备 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105654135A (zh) * | 2015-12-30 | 2016-06-08 | 成都数联铭品科技有限公司 | 一种基于递归神经网络的图像文字序列识别系统 |
CN106446782A (zh) * | 2016-08-29 | 2017-02-22 | 北京小米移动软件有限公司 | 图像识别方法及装置 |
CN107480144A (zh) * | 2017-08-03 | 2017-12-15 | 中国人民大学 | 具备跨语言学习能力的图像自然语言描述生成方法和装置 |
CN108228686A (zh) * | 2017-06-15 | 2018-06-29 | 北京市商汤科技开发有限公司 | 用于实现图文匹配的方法、装置和电子设备 |
CN108288067A (zh) * | 2017-09-12 | 2018-07-17 | 腾讯科技(深圳)有限公司 | 图像文本匹配模型的训练方法、双向搜索方法及相关装置 |
CN108416059A (zh) * | 2018-03-22 | 2018-08-17 | 北京市商汤科技开发有限公司 | 图像描述模型的训练方法和装置、设备、介质、程序 |
US10089742B1 (en) * | 2017-03-14 | 2018-10-02 | Adobe Systems Incorporated | Automatically segmenting images based on natural language phrases |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107491764A (zh) | 2017-08-25 | 2017-12-19 | 电子科技大学 | 一种基于深度卷积神经网络的违规驾驶检测方法 |
CN107657008B (zh) * | 2017-09-25 | 2020-11-03 | 中国科学院计算技术研究所 | 基于深度判别排序学习的跨媒体训练及检索方法 |
CN107832292B (zh) * | 2017-11-02 | 2020-12-29 | 合肥工业大学 | 一种基于神经网络模型的图像到汉语古诗的转换方法 |
-
2018
- 2018-10-08 CN CN201811167476.9A patent/CN110147806B/zh active Active
-
2019
- 2019-07-05 WO PCT/CN2019/094891 patent/WO2020073700A1/zh unknown
- 2019-07-05 EP EP19870771.3A patent/EP3866068A4/en active Pending
-
2020
- 2020-10-20 US US17/075,618 patent/US12073321B2/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105654135A (zh) * | 2015-12-30 | 2016-06-08 | 成都数联铭品科技有限公司 | 一种基于递归神经网络的图像文字序列识别系统 |
CN106446782A (zh) * | 2016-08-29 | 2017-02-22 | 北京小米移动软件有限公司 | 图像识别方法及装置 |
US10089742B1 (en) * | 2017-03-14 | 2018-10-02 | Adobe Systems Incorporated | Automatically segmenting images based on natural language phrases |
CN108228686A (zh) * | 2017-06-15 | 2018-06-29 | 北京市商汤科技开发有限公司 | 用于实现图文匹配的方法、装置和电子设备 |
CN107480144A (zh) * | 2017-08-03 | 2017-12-15 | 中国人民大学 | 具备跨语言学习能力的图像自然语言描述生成方法和装置 |
CN108288067A (zh) * | 2017-09-12 | 2018-07-17 | 腾讯科技(深圳)有限公司 | 图像文本匹配模型的训练方法、双向搜索方法及相关装置 |
CN108416059A (zh) * | 2018-03-22 | 2018-08-17 | 北京市商汤科技开发有限公司 | 图像描述模型的训练方法和装置、设备、介质、程序 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3866068A4 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112528894A (zh) * | 2020-12-17 | 2021-03-19 | 科大讯飞股份有限公司 | 一种差异项判别方法及装置 |
CN112528894B (zh) * | 2020-12-17 | 2024-05-31 | 科大讯飞股份有限公司 | 一种差异项判别方法及装置 |
CN114359942A (zh) * | 2022-01-11 | 2022-04-15 | 平安科技(深圳)有限公司 | 基于人工智能的字幕提取方法、装置、设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
EP3866068A1 (en) | 2021-08-18 |
EP3866068A4 (en) | 2022-04-13 |
US20210034981A1 (en) | 2021-02-04 |
CN110147806A (zh) | 2019-08-20 |
US12073321B2 (en) | 2024-08-27 |
CN110147806B (zh) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020073700A1 (zh) | 图像描述模型的训练方法、装置及存储介质 | |
CN108733792B (zh) | 一种实体关系抽取方法 | |
Kumar et al. | Syntax-guided controlled generation of paraphrases | |
US10255275B2 (en) | Method and system for generation of candidate translations | |
CN112528637B (zh) | 文本处理模型训练方法、装置、计算机设备和存储介质 | |
CN110929515A (zh) | 基于协同注意力和自适应调整的阅读理解方法及系统 | |
CN112287089B (zh) | 用于自动问答系统的分类模型训练、自动问答方法及装置 | |
US11475225B2 (en) | Method, system, electronic device and storage medium for clarification question generation | |
US11775770B2 (en) | Adversarial bootstrapping for multi-turn dialogue model training | |
US20220058444A1 (en) | Asymmetric adversarial learning framework for multi-turn dialogue response generation | |
CN112131883B (zh) | 语言模型训练方法、装置、计算机设备和存储介质 | |
US20200134455A1 (en) | Apparatus and method for training deep learning model | |
CN110472255B (zh) | 神经网络机器翻译方法、模型、电子终端以及存储介质 | |
WO2023137911A1 (zh) | 基于小样本语料的意图分类方法、装置及计算机设备 | |
CN111125367A (zh) | 一种基于多层次注意力机制的多种人物关系抽取方法 | |
US20200042547A1 (en) | Unsupervised text simplification using autoencoders with a constrained decoder | |
CN111881292B (zh) | 一种文本分类方法及装置 | |
WO2021034941A1 (en) | A method for multi-modal retrieval and clustering using deep cca and active pairwise queries | |
CN111598183A (zh) | 一种多特征融合图像描述方法 | |
CN114186063A (zh) | 跨域文本情绪分类模型的训练方法和分类方法 | |
CN115861995A (zh) | 一种视觉问答方法、装置及电子设备和存储介质 | |
CN117556276B (zh) | 用于确定文本和视频之间的相似度的方法和装置 | |
Yang | [Retracted] Application of LSTM Neural Network Technology Embedded in English Intelligent Translation | |
CN114386480A (zh) | 视频内容描述模型的训练方法、应用方法、设备及介质 | |
CN114330367A (zh) | 一种基于句子的语义相似度获得方法、装置以及设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19870771 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2019870771 Country of ref document: EP Effective date: 20210510 |