CN111488473B - Picture description generation method, device and computer readable storage medium - Google Patents

Picture description generation method, device and computer readable storage medium Download PDF

Info

Publication number
CN111488473B
CN111488473B CN201910078978.2A CN201910078978A CN111488473B CN 111488473 B CN111488473 B CN 111488473B CN 201910078978 A CN201910078978 A CN 201910078978A CN 111488473 B CN111488473 B CN 111488473B
Authority
CN
China
Prior art keywords
word
sentence
picture
training sample
description
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910078978.2A
Other languages
Chinese (zh)
Other versions
CN111488473A (en
Inventor
王晶
梅涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910078978.2A priority Critical patent/CN111488473B/en
Publication of CN111488473A publication Critical patent/CN111488473A/en
Application granted granted Critical
Publication of CN111488473B publication Critical patent/CN111488473B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a method, a device and a computer readable storage medium for generating picture descriptions, and relates to the technical field of artificial intelligence. The method comprises the following steps: extracting first characteristics of each target picture in the picture stream through a multi-class classification model; extracting second features of each target picture through a multi-instance classification model or a multi-label classification model; inputting the first feature and the second feature into a trained GAN model generator, and determining word selection probability distribution of each word in the word list; and selecting the word with the highest probability in the word selection probability distribution, and generating the sentence description of each target picture, wherein the sentence description of each target picture forms the paragraph description of the picture flow. The technical scheme of the disclosure can describe the accuracy.

Description

Picture description generation method, device and computer readable storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a method for generating a picture description, an apparatus for generating a picture description, and a computer readable storage medium.
Background
The output of a story description (i.e., paragraph description) composed of multiple sentences with an ordered picture stream as an input is one of the important tasks in the fields of computer vision and natural language processing.
In the related art, a story description may be generated based on a search manner, i.e., a suitable story description is found for a picture stream from an existing dataset; the entire story description may also be learned as one long sentence based on RNN (Recurrent Neural Network ).
Disclosure of Invention
The inventors of the present disclosure found that the above-described related art has the following problems: the most suitable story descriptions can only be found from the existing dataset based on the retrieval; the manner in which long sentences are generated based on RNNs can cause learning disorders, resulting in inaccurate descriptions being generated.
In view of this, the present disclosure proposes a technical solution for generating a picture description, which can improve the accuracy of the description.
According to some embodiments of the present disclosure, there is provided a method for generating a picture description, including: extracting first characteristics of each target picture in the picture stream through a multi-class classification model; extracting second features of each target picture through a multi-instance classification model or a multi-label classification model; inputting the first feature and the second feature into a trained GAN (Generative Adversarial Nets) model generator to generate an countermeasure network, and determining a word selection probability distribution of each word in a word list; and selecting the word with the highest probability in the word selection probability distribution, and generating sentence descriptions of each target picture, wherein the sentence descriptions of each target picture form paragraph descriptions of the picture stream.
In some embodiments, the generator of the GAN model includes a first RNN model and a second RNN model; the determining the word selection probability distribution of each word in the word list comprises the following steps: inputting the first feature into the first RNN model to generate a theme vector of each target picture; and inputting the second feature and the topic vector into the second RNN model, and determining the word selection probability distribution of each word in the word list.
In some embodiments, the discriminator of the GAN model includes a first classifier for determining whether the sentence descriptions conform to the actual content of the corresponding target picture, and a second classifier for determining whether the paragraph descriptions conform to the language style of the pre-labeled paragraph sample.
In some embodiments, the discriminators of the GAN model are trained with sentence descriptions and paragraph descriptions generated by the generator of the GAN model as negative samples and pre-labeled sentence samples and paragraph samples as positive samples.
In some embodiments, according to each word in the sentence description and the word selection probability distribution, the words in the vocabulary are sampled, and each training sample sentence corresponding to each word is generated; inputting each training sample sentence into a discriminator of the GAN model, and determining the positive category probability of each training sample sentence; determining a first weight of each word corresponding to each training sample sentence according to the positive category probability; and updating a generator of the GAN model according to the first weight of each word.
In some embodiments, selecting a word of the sentence description as a target word, and reserving each word of the sentence description, which is arranged in front of the target word, as a word at a corresponding position in the training sample sentence; inputting the target word into a generator of the GAN model, and determining word selection probability distribution of each word in the word list; and according to the word selection probability distribution, each word in the word list is sampled to determine each word arranged behind the target word in the training sample sentence.
In some embodiments, generating training sample paragraphs corresponding to the words according to the training sample sentences; inputting the training sample paragraphs into a discriminator of the GAN model, and determining positive category probabilities of the training sample paragraphs; determining a second weight of each word corresponding to each training sample paragraph according to the positive category probability of each training sample paragraph; and updating the generator of the GAN model according to the second weight.
In some embodiments, selecting one of the training sample sentences as a target sentence; taking the target sentence as a sentence at a corresponding position in the training sample paragraph; and taking each sentence description arranged in front of and behind the target sentence in the paragraph descriptions as a sentence at a corresponding position in the training sample paragraph.
According to other embodiments of the present disclosure, there is provided a generation apparatus of a picture description, including: the extraction unit is used for extracting first characteristics of each target picture in the picture stream through a multi-class classification model, and extracting second characteristics of each target picture through a multi-instance classification model or a multi-label classification model; the determining unit is used for inputting the first feature and the second feature into a trained generator of the GAN model and determining word selection probability distribution of each word in the word list; and the generating unit is used for selecting the word with the highest probability in the word selection probability distribution, generating the sentence description of each target picture, and forming the paragraph description of the picture stream by the sentence description of each target picture.
In some embodiments, the generator of the GAN model includes a first RNN model and a second RNN model; the determining unit inputs the first feature into the first RNN model, generates a theme vector of each target picture, inputs the second feature and the theme vector into the second RNN model, and determines word selection probability distribution of each word in the word list.
In some embodiments, the discriminator of the GAN model includes a first classifier for determining whether the sentence descriptions conform to the actual content of the corresponding target picture, and a second classifier for determining whether the paragraph descriptions conform to the language style of the pre-labeled paragraph sample.
In some embodiments, the generating device further includes: and the training unit is used for training the discriminator of the GAN model by taking the sentence description and the paragraph description generated by the generator of the GAN model as negative samples and taking the pre-marked sentence sample and the paragraph sample as positive samples.
In some embodiments, the generating unit samples words in the vocabulary according to each word in the sentence description and the word selection probability distribution, and generates each training sample sentence corresponding to each word; the determining unit inputs each training sample sentence into a discriminator of the GAN model, determines positive category probability of each training sample sentence, and determines a first weight of each word corresponding to each training sample sentence according to the positive category probability; and the training unit is used for updating the generator of the GAN model according to the first weight of each word.
In some cases, the generating unit selects one word of the sentence description as a target word, reserves each word of the sentence description, which is arranged in front of the target word, as a word of a corresponding position in the training sample sentence, inputs the target word into the generator of the GAN model, determines a word selection probability distribution of each word in the vocabulary, and samples each word in the vocabulary according to the word selection probability distribution to determine each word of the training sample sentence, which is arranged behind the target word.
In some embodiments, the generating unit generates a training sample paragraph corresponding to each word according to each training sample sentence; the determining unit inputs the training sample paragraphs into a discriminator of the GAN model, determines positive category probabilities of the training sample paragraphs, and determines second weights of words corresponding to the training sample paragraphs according to the positive category probabilities of the training sample paragraphs; and the training unit updates the generator of the GAN model according to the second weight.
In some embodiments, the generating unit selects one of the training sample sentences as a target sentence, the target sentence is used as a sentence at a corresponding position in the training sample paragraph, and each of the sentence descriptions arranged before and after the target sentence is used as a sentence at a corresponding position in the training sample paragraph.
According to still further embodiments of the present disclosure, there is provided a generation apparatus of a picture description, including: a memory; and a processor coupled to the memory, the processor configured to perform the method of generating a picture description in any of the embodiments described above based on instructions stored in the memory device.
According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of generating a picture description in any of the above embodiments.
In the embodiment, the features of the main targets in the pictures are extracted based on the multi-type classification model, and the features of the multi-type targets in the pictures are extracted based on the multi-label classification model, so that the mutual connection between the features of the picture flow and the targets can be deeply mined; and generating a sentence description for each picture by using the RNN, and finally forming paragraph descriptions, thereby avoiding learning barriers caused by long sentences. The paragraph description logic obtained in this way is more coherent and more consistent with the content of the picture stream, thereby improving the accuracy of description.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 illustrates a flow chart of some embodiments of a method of generating a picture description of the present disclosure;
FIG. 2 illustrates a flow chart of some embodiments of step 130 of FIG. 1;
FIG. 3 illustrates a flow chart of some embodiments of a GAN model generator training method of the present disclosure;
FIG. 4 illustrates a flow chart of some embodiments of step 310 in FIG. 3;
fig. 5 illustrates a flow chart of further embodiments of the GAN model generator training method of the present disclosure;
FIG. 6 illustrates a block diagram of some embodiments of a generation apparatus of a picture description of the present disclosure;
FIG. 7 shows a block diagram of further embodiments of a generation apparatus of a picture description of the present disclosure;
fig. 8 shows a block diagram of still further embodiments of a generation apparatus of the picture descriptions of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate.
In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
Fig. 1 illustrates a flow chart of some embodiments of a method of generating a picture description of the present disclosure.
As shown in fig. 1, the method includes: step 110, extracting first characteristics of each target picture; step 120, extracting a second feature of each target picture; step 130, determining a word selection probability distribution; step 140, generating sentence descriptions and paragraph descriptions.
In step 110, a first feature of each target picture in the picture stream is extracted by a multi-class classification (multi-class classification) model. For example, the picture stream may be an ordered plurality of pictures. In a multi-class classification model, one sample belongs to and only belongs to one of a plurality of classes, and the different classes are mutually exclusive.
In some embodiments, a CNN (Convolutional Neural Networks, convolutional neural network) model in a multi-class classification model is used to extract depth features of each target picture as first features, so that feature extraction can be performed on main targets in each target picture.
In step 120, a second feature of each target picture is extracted by a multi-instance classification (multi-instance learning) model or a multi-label classification (multi-label classification) model. In the multi-label classification model, one sample can belong to a plurality of categories (or labels), and different categories are related, so that feature extraction can be performed on a plurality of targets in each target picture to obtain attribute features, namely second features, of the picture. Therefore, multiple targets can be identified, the generated sentence description is more consistent with the content of the picture, and logical continuity between sentences in the finally generated paragraph description is facilitated, so that the accuracy of the description is improved.
In step 130, the first feature and the second feature are input into a trained GAN model generator to determine a word selection probability distribution for each word in the vocabulary. The GAN model is composed of a generator and a discriminator, and the generator may be an RNN model for generating sentence descriptions and paragraph descriptions of pictures, for example.
In some embodiments, the generator of the GAN model includes a first RNN model and a second RNN model. For example, step 130 may be implemented by the embodiment of fig. 2.
Fig. 2 shows a flow chart of some embodiments of step 130 in fig. 1.
As shown in fig. 2, step 130 may include: step 1310, generating a theme vector of each target picture; and step 1320, determining a word choice probability distribution.
In step 1310, a first feature is input into a first RNN model, and a topic vector (topic vector) for each target picture is generated.
In some embodiments, depth features are sequentially input into the first RNN model to encode information of each picture, so as to obtain a series of encoded information, i.e., a topic vector. For example, at a first moment, the first RNN model takes depth features of a first picture as input, and outputs a network state as input at a next moment; at the later moment, the first RNN model takes depth characteristics of corresponding pictures and network states output at the previous moment as inputs, and outputs the network states at the current moment; the first RNN model takes the network state output at each moment as one piece of encoded information, and one piece of encoded information corresponds to one sentence description.
And fusing the first characteristics of each picture in the picture stream through a first RNN model to obtain the theme vector of each picture. The topic vectors are equivalent to the central ideas of the sentence descriptions corresponding to the pictures, and the topic vectors generated in this way are associated with each other, so that each sentence description in the finally formed paragraph description has logical continuity, and the accuracy is improved.
In step 1320, the second feature and topic vector are input into a second RNN model to determine a word choice probability distribution for each word in the vocabulary. For example, there are two words a and b in the vocabulary, and the probability distribution of the selected word output by the second RNN model for the current word is 0.6 and 0.4, respectively, i.e., a has a probability of 60% of being selected as the current word and b has a probability of 40% of being selected as the current word. Therefore, the multi-label classification feature and the related topic vector are fused to determine the word selection probability distribution of each word, so that the generated sentence description logic is more coherent, and the description accuracy is improved.
After the choice probability distribution is determined, sentence descriptions and paragraph descriptions can be generated by step 140 in fig. 1.
In step 140, the word with the highest probability in the word selection probability distribution is selected, and sentence descriptions of each target picture are generated, and the sentence descriptions of each target picture constitute paragraph descriptions of the picture stream.
In some embodiments, taking the coding information of the first target picture as an initial state of the second RNN model, taking the attribute characteristics of the first image as an input of a first moment of the second RNN model, and calculating an output of the first moment; and taking a preset sentence start symbol as input of a second moment of the second RNN model, outputting word selection probability distribution of a first word aiming at sentence description, and taking a word with the highest probability as the generated first word.
Taking the first word as input of a second RNN model at a third moment, outputting word selection probability distribution of the second word, and taking the word with the highest probability as the generated second word; and so on until an ending symbol is generated or a maximum sentence length is reached; and connecting the generated words to form a sentence description of the first target picture.
According to the generation process, sentence descriptions of other target pictures can be generated, and all sentence descriptions are connected to form paragraph descriptions of the picture stream.
In some embodiments, the generator of the GAN model may be pre-trained by pre-labeled vote probability distributions. For example, the word selection probability distribution is such that the probability of a pre-labeled word is set to 1 and the probability of other words is set to 0. For example, respectively inputting attribute characteristics of the target picture and coding information output by the first RNN model into the second RNN model; mapping hidden layer output of the second RNN model at each moment to dimensionality of the vocabulary through a full connection layer, so that word selection probability distribution of all words on the vocabulary is obtained; cross entropy loss of the obtained word selection probability distribution and the labeled word selection probability distribution is calculated; the parameters of the generator are updated by Back Propagation (Back Propagation) until the generator reaches the optimum for the pre-training phase, stopping training.
In some embodiments, the discriminators of the GAN model include a first classifier and a second classifier. The first classifier is used for judging whether the sentence description accords with the actual content of the corresponding target picture; the second classifier is used for judging whether the paragraph description accords with the language style of the pre-marked paragraph sample.
In some embodiments, both the first classifier and the second classifier may be implemented using a deep neural network. The output of the first classifier and the second classifier is the probability of judging the sentence description and the paragraph description as positive and negative categories, respectively, that is, the coincidence degree of the sentence description and the picture content or the coincidence degree of the paragraph description and the habit of human expression is obtained.
For example, some paragraph samples may be pre-labeled for determining whether the generated paragraph descriptions conform to the habit of human expression. The language style of the paragraph samples accords with the habit of human expression, for example, the language style can comprise various expression forms, rich words, logical consistency of front and rear sentences and the like.
In some embodiments, the discriminators of the GAN model may be trained with sentence descriptions and paragraph descriptions generated by the generator of the GAN model as negative samples, and pre-labeled sentence samples and paragraph samples as positive samples.
For example, the first classifier and the second classifier are trained by using a batch of sentence descriptions and paragraph seconds generated by the pre-trained generator as negative samples, and selecting a batch of sentence samples and paragraph samples from the labeled data as positive samples.
In some embodiments, the generator may be further formally trained using the classifier after training. For example, this may be achieved by the embodiment of fig. 3.
Fig. 3 illustrates a flow chart of some embodiments of the GAN model generator training method of the present disclosure.
As shown in fig. 3, the method includes: step 310, generating each training sample sentence; step 320, determining a positive class probability; step 330, determining a first weight of each word; step 340, updating the generator of the GAN model.
In step 310, the words in the vocabulary are sampled according to the words and the word selection probability distribution in the sentence description, and each training sample sentence corresponding to each word is generated. That is, after the sentence descriptions of the target pictures are generated, a corresponding training sample sentence may be generated based on each word in the sentence descriptions.
In some embodiments, the generated sentence description may have 20 words, where N is an integer between 1 and 20, where N is the first N words of the sentence description are fixed, the word selection probability distribution of the next word is calculated based on the fixed words, and the next word is determined by sampling in the vocabulary according to the word selection probability distribution until the entire training sample sentence is completed (where N is 20, the sentence description may be directly used as the training sample sentence).
Here, according to the word selection probability distribution sampling, it means that the words in the vocabulary obey the word selection probability distribution, and the words are randomly selected from the vocabulary, instead of selecting the words with the highest probability. For example, there are two words a and b in the vocabulary, and the probability distribution of the selection word output by the second RNN model for the current word is 0.6 and 0.4, respectively. In this case, the probability of sampling a as the target word is 60%, and the probability of sampling b as the target word is 40%, instead of directly selecting a as the target word.
Thus, each word in the sentence description generated by the generator has a training sample sentence generated only according to the word, and the weight of the training sample sentence is calculated by the discriminator, so that the weight of the corresponding word can be obtained for evaluating the correctness of the word. Compared with the prior art, the training generator is trained by using the sentence description as the whole accuracy, and the training generator can be trained more precisely and accurately by using the weight of the word as the basis of the training generator, so that the description accuracy is improved.
In some embodiments, step 310 may be implemented by the embodiment of fig. 4.
Fig. 4 shows a flow chart of some embodiments of step 310 in fig. 3.
As shown in fig. 4, step 310 may include: step 3110, determining words before the target word; step 3120, determining a word selection probability distribution; and step 3130, determining the word following the target word.
In step 3110, a word of the sentence description is selected as a target word, and words arranged in front of the target word in the sentence description are reserved as words at corresponding positions in the training sample sentence. For example, the first two words of the sentence description are fixed, i.e. the second word is selected as the target word, and the first two words of the training sample sentence are identical to the first two words of the sentence description.
In step 3120, the target word is input into the generator of the GAN model, and the word selection probability distribution of each word in the vocabulary is determined. For example, a second word is input into the generator, and a word selection probability distribution is determined.
In step 3120, the words in the vocabulary are sampled according to the word selection probability distribution to determine words in the training sample sentence that are arranged after the target word.
For example, the third word of the training sample sentence is determined by sampling in the vocabulary according to the word selection probability distribution. The third word may then be input into a generator that determines the fourth word by sampling. And so on until all words in the training sample sentence that are arranged after the second word are generated.
After the training sample sentences corresponding to the words in the sentence description are generated, the generator can be formally trained according to the rest steps in fig. 3.
In step 320, each training sample sentence is input to a discriminator of the GAN model, and a positive class probability of each training sample sentence is determined. For example, after each training sample sentence is input into the first classifier, the first classifier outputs a classification probability distribution of the training sample sentence, and records a positive classification probability.
In step 330, a first weight for each word corresponding to each training sample sentence is determined based on the positive class probability for each training sample sentence. For example, when determining each word in the sentence description, a word selection probability of each word is obtained, and the word selection probability of the word may be multiplied by a positive category probability of a training sample sentence generated according to the word as the first weight of the word.
In step 340, the generator of the GAN model is updated according to the first weight.
Fig. 5 illustrates a flow chart of further embodiments of the GAN model generator training methods of the present disclosure.
As shown in fig. 5, the method includes: step 510, generating each training sample paragraph; step 520, determining a positive class probability; step 530, determining a second weight of each word; step 540, update the generator of the GAN model.
In step 510, a training sample paragraph corresponding to each word is generated from each training sample sentence.
In some embodiments, a training sample sentence is selected as the target sentence. For example, a training sample sentence generated from the target word may be selected as the target sentence, and the position of the target sentence in the paragraph description may be recorded. Then, the target sentence is used as the sentence of the corresponding position in the training sample paragraph. Finally, each sentence description arranged in front of and behind the target sentence in the paragraph descriptions is used as the sentence of the corresponding position in the training sample paragraph.
In step 520, the training sample paragraphs are input to a discriminator of the GAN model, and positive class probabilities for each training sample paragraph are determined. For example, after each training sample segment is input into the second classifier, the second classifier outputs the classification probability distribution of the training sample segment, and records the positive classification probability.
In step 530, a second weight for each word corresponding to each training sample paragraph is determined based on the positive category probabilities for each training sample paragraph.
In step 540, the generator of the GAN model is updated according to the second weight.
In some embodiments, the degree of matching of the generated paragraph description and the pre-labeled paragraph sample may be calculated based on the weights of the sentence descriptions, the natural language processed evaluation index (e.g., METEOR, CIDEr, etc.).
Under the condition of low matching degree, an anti-learning strategy can be adopted to carry out iterative training and competition on the generator and the classifier. I.e. after each training of the generator, the training of the classifier is performed, so that the generator and the classifier obtain better effects. For example, the classifier of the GAN model is trained and updated by the method in any of the embodiments described above, and the generator of the GAN model is trained by the method in any of the embodiments described above until the degree of matching of the generated paragraph descriptions meets the requirements.
In the embodiment, the features of the main targets in the pictures are extracted based on the multi-type classification model, and the features of the multi-type targets in the pictures are extracted based on the multi-label classification model, so that the mutual connection between the features of the picture flow and the targets can be deeply mined; and generating a sentence description for each picture by using the RNN, and finally forming paragraph descriptions, thereby avoiding learning barriers caused by long sentences. The paragraph description logic obtained in this way is more coherent and more consistent with the content of the picture stream, thereby improving the accuracy of description.
Fig. 6 illustrates a block diagram of some embodiments of a generation apparatus of a picture description of the present disclosure.
As shown in fig. 6, the generation apparatus 6 of the picture description includes an extraction unit 61, a determination unit 62, and a generation unit 63.
The extracting unit 61 extracts a first feature of each target picture in the picture stream by the multi-class classification model, and extracts a second feature of each target picture by the multi-instance classification model or the multi-label classification model.
The determining unit 62 inputs the first feature and the second feature into a generator of the trained GAN model, and determines a word selection probability distribution of each word in the vocabulary. For example, the generator of the GAN model includes a first RNN model and a second RNN model.
In some embodiments, the determining unit 62 inputs the first feature into a first RNN model, generates a topic vector for each target picture, inputs the second feature and the topic vector into a second RNN model, and determines a word selection probability distribution for each word in the vocabulary.
The generating unit 63 selects a word with the highest probability in the word selection probability distribution, and generates a sentence description of each target picture. The sentence descriptions of the respective target pictures constitute a paragraph description of the picture stream.
In some embodiments, the discriminators of the GAN model include a first classifier and a second classifier. The first classifier is used for judging whether the sentence description accords with the actual content of the corresponding target picture, and the second classifier is used for judging whether the paragraph description accords with the language style of the pre-marked paragraph sample.
In some embodiments, the generation means 6 of the picture description further comprises a training unit 64. The training unit 64 trains the discriminators of the GAN model with the sentence descriptions and the paragraph descriptions generated by the generator of the GAN model as negative samples and with the pre-labeled sentence samples and the paragraph samples as positive samples.
In some embodiments, the generating unit 63 samples words in the vocabulary according to the word and the word selection probability distribution in the sentence description, and generates each training sample sentence corresponding to each word. The determining unit 62 inputs each training sample sentence into the discriminator of the GAN model, determines the positive class probability of each training sample sentence, and determines the first weight of each word corresponding to each training sample sentence based on the positive class probability. The training unit 64 updates the generator of the GAN model according to the first weight.
In some embodiments, the generation unit 63 selects one word of the sentence description as the target word. The generating unit 63 retains each word arranged in front of the target word in the sentence description as a word at a corresponding position in the training sample sentence; the generating unit 63 inputs the target word into a generator of the GAN model, and determines a word selection probability distribution of each word in the vocabulary; the generating unit 63 samples each word in the vocabulary according to the word selection probability distribution to determine each word arranged after the target word in the training sample sentence.
In some embodiments, the generating unit 63 generates a training sample paragraph corresponding to each word from each training sample sentence. The determining unit 62 inputs the training sample paragraphs to the discriminator of the GAN model, determines the positive class probability of each training sample paragraph, and determines the second weight of each word corresponding to each training sample paragraph based on the positive class probability of each training sample paragraph. The training unit 64 updates the generator of the GAN model according to the second weight.
In some embodiments, the generating unit 63 selects one training sample sentence as the target sentence, the target sentence as the sentence at the corresponding position in the training sample paragraph, and each of the sentence descriptions arranged before and after the target sentence as the sentence at the corresponding position in the training sample paragraph.
In the embodiment, the features of the main targets in the pictures are extracted based on the multi-type classification model, and the features of the multi-type targets in the pictures are extracted based on the multi-label classification model, so that the mutual connection between the features of the picture flow and the targets can be deeply mined; and generating a sentence description for each picture by using the RNN, and finally forming paragraph descriptions, thereby avoiding learning barriers caused by long sentences. The paragraph description logic obtained in this way is more coherent and more consistent with the content of the picture stream, thereby improving the accuracy of description.
Fig. 7 shows a block diagram of further embodiments of a generation apparatus of a picture description of the present disclosure.
As shown in fig. 7, the picture description generating apparatus 7 of this embodiment includes: a memory 71 and a processor 72 coupled to the memory 71, the processor 72 being configured to perform the method of generating a picture description in any one of the embodiments of the present disclosure based on instructions stored in the memory 71.
The memory 71 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), database, and other programs.
Fig. 8 shows a block diagram of still further embodiments of a generation apparatus of the picture descriptions of the present disclosure.
As shown in fig. 8, the picture description generating apparatus 8 of this embodiment includes: a memory 810 and a processor 820 coupled to the memory 810, the processor 820 being configured to perform the method of generating a picture description in any of the foregoing embodiments based on instructions stored in the memory 810.
Memory 810 may include, for example, system memory, fixed nonvolatile storage media, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), and other programs.
The generation means 8 of the picture description may further comprise an input-output interface 830, a network interface 840, a storage interface 850, etc. These interfaces 830, 840, 850 and the memory 810 and the processor 820 may be connected by, for example, a bus 860. The input/output interface 830 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, a touch screen, and the like. The network interface 840 provides a connection interface for various networking devices. Storage interface 850 provides a connection interface for external storage devices such as SD cards, U-discs, and the like.
It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Heretofore, a method of generating a picture description, a device of generating a picture description, and a computer-readable storage medium according to the present disclosure have been described in detail. In order to avoid obscuring the concepts of the present disclosure, some details known in the art are not described. How to implement the solutions disclosed herein will be fully apparent to those skilled in the art from the above description.
The methods and systems of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the disclosure. The scope of the present disclosure is defined by the appended claims.

Claims (11)

1. A method of generating a picture description, comprising:
extracting first characteristics of each target picture in a picture stream through a multi-class classification model, wherein in a picture classification result of the multi-class classification model, one picture belongs to and only belongs to one of a plurality of first classes, and the plurality of first classes have mutual exclusion relations;
extracting second characteristics of each target picture through a multi-instance classification model or a multi-label classification model, wherein in a picture classification result of the multi-label classification model, one picture belongs to a plurality of second categories, and the plurality of second categories have association relations;
inputting the first characteristic and the second characteristic into a trained generator for generating an antagonism network GAN model, and determining word selection probability distribution of each word in a word list;
and selecting the word with the highest probability in the word selection probability distribution, and generating sentence descriptions of each target picture, wherein the sentence descriptions of each target picture form paragraph descriptions of the picture stream.
2. The method of generating according to claim 1, wherein,
the generator of the GAN model includes a first RNN model and a second RNN model;
the determining the word selection probability distribution of each word in the word list comprises the following steps:
inputting the first feature into the first RNN model to generate a theme vector of each target picture;
and inputting the second feature and the topic vector into the second RNN model, and determining the word selection probability distribution of each word in the word list.
3. The method of generating according to claim 1, wherein,
the discriminator of the GAN model comprises a first classifier and a second classifier, wherein the first classifier is used for judging whether the sentence description accords with the actual content of the corresponding target picture, and the second classifier is used for judging whether the paragraph description accords with the language style of the pre-marked paragraph sample.
4. The generation method according to claim 1, further comprising:
and training a discriminator of the GAN model by taking sentence descriptions and paragraph descriptions generated by a generator of the GAN model as negative samples and taking pre-labeled sentence samples and paragraph samples as positive samples.
5. The generation method according to any one of claims 1 to 4, further comprising:
according to each word in the sentence description and the word selection probability distribution, sampling the words in the word list to generate each training sample sentence corresponding to each word;
inputting each training sample sentence into a discriminator of the GAN model, and determining the positive category probability of each training sample sentence;
determining a first weight of each word corresponding to each training sample sentence according to the positive category probability of each training sample sentence;
and updating a generator of the GAN model according to the first weight of each word.
6. The generating method according to claim 5, wherein the generating each training sample sentence corresponding to each word includes:
selecting one word of the sentence description as a target word, and reserving each word arranged in front of the target word in the sentence description as a word at a corresponding position in the training sample sentence;
inputting the target word into a generator of the GAN model, and determining word selection probability distribution of each word in the word list;
and according to the word selection probability distribution, each word in the word list is sampled to determine each word arranged behind the target word in the training sample sentence.
7. The generation method according to claim 5, further comprising:
generating training sample paragraphs corresponding to the words according to the training sample sentences;
inputting the training sample paragraphs into a discriminator of the GAN model, and determining positive category probabilities of the training sample paragraphs;
determining a second weight of each word corresponding to each training sample paragraph according to the positive category probability of each training sample paragraph;
and updating the generator of the GAN model according to the second weight.
8. The generating method according to claim 7, wherein the generating each training sample paragraph corresponding to each word includes:
selecting one training sample sentence as a target sentence;
taking the target sentence as a sentence at a corresponding position in the training sample paragraph;
and taking each sentence description arranged in front of and behind the target sentence in the paragraph descriptions as a sentence at a corresponding position in the training sample paragraph.
9. A picture description generation apparatus, comprising:
the extraction unit is used for extracting first characteristics of each target picture in the picture stream through a multi-class classification model, extracting second characteristics of each target picture through a multi-instance classification model or a multi-label classification model, wherein one picture belongs to and only belongs to one of a plurality of first classes in a picture classification result of the multi-class classification model, the plurality of first classes have mutual exclusion relations, and one picture belongs to a plurality of second classes in a picture classification result of the multi-label classification model, and the plurality of second classes have association relations;
the determining unit is used for inputting the first characteristic and the second characteristic into a trained generator for generating an countermeasure network GAN model and determining word selection probability distribution of each word in the word list;
and the generating unit is used for selecting the word with the highest probability in the word selection probability distribution, generating the sentence description of each target picture, and forming the paragraph description of the picture stream by the sentence description of each target picture.
10. A picture description generation apparatus, comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the method of generating a picture description of any of claims 1-8 based on instructions stored in the memory device.
11. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of generating a picture description according to any of claims 1-8.
CN201910078978.2A 2019-01-28 2019-01-28 Picture description generation method, device and computer readable storage medium Active CN111488473B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910078978.2A CN111488473B (en) 2019-01-28 2019-01-28 Picture description generation method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910078978.2A CN111488473B (en) 2019-01-28 2019-01-28 Picture description generation method, device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111488473A CN111488473A (en) 2020-08-04
CN111488473B true CN111488473B (en) 2023-11-07

Family

ID=71794049

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910078978.2A Active CN111488473B (en) 2019-01-28 2019-01-28 Picture description generation method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111488473B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114358203B (en) * 2022-01-11 2024-09-27 平安科技(深圳)有限公司 Training method and device for image description sentence generation module and electronic equipment
US20240143936A1 (en) * 2022-10-31 2024-05-02 Zoom Video Communications, Inc. Intelligent prediction of next step sentences from a communication session

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980867A (en) * 2016-01-15 2017-07-25 奥多比公司 Semantic concept in embedded space is modeled as distribution
CN107330444A (en) * 2017-05-27 2017-11-07 苏州科技大学 A kind of image autotext mask method based on generation confrontation network
CN108648746A (en) * 2018-05-15 2018-10-12 南京航空航天大学 A kind of open field video natural language description generation method based on multi-modal Fusion Features

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090290802A1 (en) * 2008-05-22 2009-11-26 Microsoft Corporation Concurrent multiple-instance learning for image categorization
US8249366B2 (en) * 2008-06-16 2012-08-21 Microsoft Corporation Multi-label multi-instance learning for image classification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980867A (en) * 2016-01-15 2017-07-25 奥多比公司 Semantic concept in embedded space is modeled as distribution
CN107330444A (en) * 2017-05-27 2017-11-07 苏州科技大学 A kind of image autotext mask method based on generation confrontation network
CN108648746A (en) * 2018-05-15 2018-10-12 南京航空航天大学 A kind of open field video natural language description generation method based on multi-modal Fusion Features

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈虹君 等.基于CNN-RNN深度学习的图片描述方法与优化.湘潭大学自然科学学报.2018,40(2),全文. *

Also Published As

Publication number Publication date
CN111488473A (en) 2020-08-04

Similar Documents

Publication Publication Date Title
US11734319B2 (en) Question answering method and apparatus
CN112560479B (en) Abstract extraction model training method, abstract extraction device and electronic equipment
CN110097085B (en) Lyric text generation method, training method, device, server and storage medium
CN113313022B (en) Training method of character recognition model and method for recognizing characters in image
CN113642431B (en) Training method and device of target detection model, electronic equipment and storage medium
CN111159454A (en) Image description generation method and system based on Actor-Critic generative adversarial network
CN112560912A (en) Method and device for training classification model, electronic equipment and storage medium
US11687716B2 (en) Machine-learning techniques for augmenting electronic documents with data-verification indicators
CN110222328B (en) Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium
CN111506709B (en) Entity linking method and device, electronic equipment and storage medium
CN113569018B (en) Question-answer pair mining method and device
CN106227719B (en) Chinese word segmentation disambiguation method and system
KR20240034804A (en) Evaluating output sequences using an autoregressive language model neural network
CN111046178A (en) Text sequence generation method and system
CN111428448A (en) Text generation method and device, computer equipment and readable storage medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN117332090A (en) Sensitive information identification method, device, equipment and storage medium
CN111488473B (en) Picture description generation method, device and computer readable storage medium
CN114639096A (en) Text recognition method and device, electronic equipment and storage medium
US11321527B1 (en) Effective classification of data based on curated features
CN112199950B (en) Network training method and device for event detection
US20230042234A1 (en) Method for training model, device, and storage medium
CN115248846B (en) Text recognition method, device and medium
CN112528048B (en) Cross-modal retrieval method, device, equipment and medium
CN103744830A (en) Semantic analysis based identification method of identity information in EXCEL document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant