CN111488473B

CN111488473B - Picture description generation method, device and computer readable storage medium

Info

Publication number: CN111488473B
Application number: CN201910078978.2A
Authority: CN
Inventors: 王晶; 梅涛
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-01-28
Filing date: 2019-01-28
Publication date: 2023-11-07
Anticipated expiration: 2039-01-28
Also published as: CN111488473A

Abstract

The disclosure relates to a method, a device and a computer readable storage medium for generating picture descriptions, and relates to the technical field of artificial intelligence. The method comprises the following steps: extracting first characteristics of each target picture in the picture stream through a multi-class classification model; extracting second features of each target picture through a multi-instance classification model or a multi-label classification model; inputting the first feature and the second feature into a trained GAN model generator, and determining word selection probability distribution of each word in the word list; and selecting the word with the highest probability in the word selection probability distribution, and generating the sentence description of each target picture, wherein the sentence description of each target picture forms the paragraph description of the picture flow. The technical scheme of the disclosure can describe the accuracy.

Description

Picture description generation method, device and computer readable storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a method for generating a picture description, an apparatus for generating a picture description, and a computer readable storage medium.

Background

The output of a story description (i.e., paragraph description) composed of multiple sentences with an ordered picture stream as an input is one of the important tasks in the fields of computer vision and natural language processing.

In the related art, a story description may be generated based on a search manner, i.e., a suitable story description is found for a picture stream from an existing dataset; the entire story description may also be learned as one long sentence based on RNN (Recurrent Neural Network ).

Disclosure of Invention

The inventors of the present disclosure found that the above-described related art has the following problems: the most suitable story descriptions can only be found from the existing dataset based on the retrieval; the manner in which long sentences are generated based on RNNs can cause learning disorders, resulting in inaccurate descriptions being generated.

In view of this, the present disclosure proposes a technical solution for generating a picture description, which can improve the accuracy of the description.

According to some embodiments of the present disclosure, there is provided a method for generating a picture description, including: extracting first characteristics of each target picture in the picture stream through a multi-class classification model; extracting second features of each target picture through a multi-instance classification model or a multi-label classification model; inputting the first feature and the second feature into a trained GAN (Generative Adversarial Nets) model generator to generate an countermeasure network, and determining a word selection probability distribution of each word in a word list; and selecting the word with the highest probability in the word selection probability distribution, and generating sentence descriptions of each target picture, wherein the sentence descriptions of each target picture form paragraph descriptions of the picture stream.

In some embodiments, the generator of the GAN model includes a first RNN model and a second RNN model; the determining the word selection probability distribution of each word in the word list comprises the following steps: inputting the first feature into the first RNN model to generate a theme vector of each target picture; and inputting the second feature and the topic vector into the second RNN model, and determining the word selection probability distribution of each word in the word list.

In some embodiments, the discriminator of the GAN model includes a first classifier for determining whether the sentence descriptions conform to the actual content of the corresponding target picture, and a second classifier for determining whether the paragraph descriptions conform to the language style of the pre-labeled paragraph sample.

In some embodiments, the discriminators of the GAN model are trained with sentence descriptions and paragraph descriptions generated by the generator of the GAN model as negative samples and pre-labeled sentence samples and paragraph samples as positive samples.

In some embodiments, according to each word in the sentence description and the word selection probability distribution, the words in the vocabulary are sampled, and each training sample sentence corresponding to each word is generated; inputting each training sample sentence into a discriminator of the GAN model, and determining the positive category probability of each training sample sentence; determining a first weight of each word corresponding to each training sample sentence according to the positive category probability; and updating a generator of the GAN model according to the first weight of each word.

In some embodiments, selecting a word of the sentence description as a target word, and reserving each word of the sentence description, which is arranged in front of the target word, as a word at a corresponding position in the training sample sentence; inputting the target word into a generator of the GAN model, and determining word selection probability distribution of each word in the word list; and according to the word selection probability distribution, each word in the word list is sampled to determine each word arranged behind the target word in the training sample sentence.

In some embodiments, generating training sample paragraphs corresponding to the words according to the training sample sentences; inputting the training sample paragraphs into a discriminator of the GAN model, and determining positive category probabilities of the training sample paragraphs; determining a second weight of each word corresponding to each training sample paragraph according to the positive category probability of each training sample paragraph; and updating the generator of the GAN model according to the second weight.

In some embodiments, selecting one of the training sample sentences as a target sentence; taking the target sentence as a sentence at a corresponding position in the training sample paragraph; and taking each sentence description arranged in front of and behind the target sentence in the paragraph descriptions as a sentence at a corresponding position in the training sample paragraph.

According to other embodiments of the present disclosure, there is provided a generation apparatus of a picture description, including: the extraction unit is used for extracting first characteristics of each target picture in the picture stream through a multi-class classification model, and extracting second characteristics of each target picture through a multi-instance classification model or a multi-label classification model; the determining unit is used for inputting the first feature and the second feature into a trained generator of the GAN model and determining word selection probability distribution of each word in the word list; and the generating unit is used for selecting the word with the highest probability in the word selection probability distribution, generating the sentence description of each target picture, and forming the paragraph description of the picture stream by the sentence description of each target picture.

In some embodiments, the generator of the GAN model includes a first RNN model and a second RNN model; the determining unit inputs the first feature into the first RNN model, generates a theme vector of each target picture, inputs the second feature and the theme vector into the second RNN model, and determines word selection probability distribution of each word in the word list.

In some embodiments, the generating device further includes: and the training unit is used for training the discriminator of the GAN model by taking the sentence description and the paragraph description generated by the generator of the GAN model as negative samples and taking the pre-marked sentence sample and the paragraph sample as positive samples.

In some embodiments, the generating unit samples words in the vocabulary according to each word in the sentence description and the word selection probability distribution, and generates each training sample sentence corresponding to each word; the determining unit inputs each training sample sentence into a discriminator of the GAN model, determines positive category probability of each training sample sentence, and determines a first weight of each word corresponding to each training sample sentence according to the positive category probability; and the training unit is used for updating the generator of the GAN model according to the first weight of each word.

In some cases, the generating unit selects one word of the sentence description as a target word, reserves each word of the sentence description, which is arranged in front of the target word, as a word of a corresponding position in the training sample sentence, inputs the target word into the generator of the GAN model, determines a word selection probability distribution of each word in the vocabulary, and samples each word in the vocabulary according to the word selection probability distribution to determine each word of the training sample sentence, which is arranged behind the target word.

In some embodiments, the generating unit generates a training sample paragraph corresponding to each word according to each training sample sentence; the determining unit inputs the training sample paragraphs into a discriminator of the GAN model, determines positive category probabilities of the training sample paragraphs, and determines second weights of words corresponding to the training sample paragraphs according to the positive category probabilities of the training sample paragraphs; and the training unit updates the generator of the GAN model according to the second weight.

In some embodiments, the generating unit selects one of the training sample sentences as a target sentence, the target sentence is used as a sentence at a corresponding position in the training sample paragraph, and each of the sentence descriptions arranged before and after the target sentence is used as a sentence at a corresponding position in the training sample paragraph.

According to still further embodiments of the present disclosure, there is provided a generation apparatus of a picture description, including: a memory; and a processor coupled to the memory, the processor configured to perform the method of generating a picture description in any of the embodiments described above based on instructions stored in the memory device.

According to still further embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of generating a picture description in any of the above embodiments.

In the embodiment, the features of the main targets in the pictures are extracted based on the multi-type classification model, and the features of the multi-type targets in the pictures are extracted based on the multi-label classification model, so that the mutual connection between the features of the picture flow and the targets can be deeply mined; and generating a sentence description for each picture by using the RNN, and finally forming paragraph descriptions, thereby avoiding learning barriers caused by long sentences. The paragraph description logic obtained in this way is more coherent and more consistent with the content of the picture stream, thereby improving the accuracy of description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 illustrates a flow chart of some embodiments of a method of generating a picture description of the present disclosure;

FIG. 2 illustrates a flow chart of some embodiments of step 130 of FIG. 1;

FIG. 3 illustrates a flow chart of some embodiments of a GAN model generator training method of the present disclosure;

FIG. 4 illustrates a flow chart of some embodiments of step 310 in FIG. 3;

fig. 5 illustrates a flow chart of further embodiments of the GAN model generator training method of the present disclosure;

FIG. 6 illustrates a block diagram of some embodiments of a generation apparatus of a picture description of the present disclosure;

FIG. 7 shows a block diagram of further embodiments of a generation apparatus of a picture description of the present disclosure;

fig. 8 shows a block diagram of still further embodiments of a generation apparatus of the picture descriptions of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Fig. 1 illustrates a flow chart of some embodiments of a method of generating a picture description of the present disclosure.

As shown in fig. 1, the method includes: step 110, extracting first characteristics of each target picture; step 120, extracting a second feature of each target picture; step 130, determining a word selection probability distribution; step 140, generating sentence descriptions and paragraph descriptions.

In step 110, a first feature of each target picture in the picture stream is extracted by a multi-class classification (multi-class classification) model. For example, the picture stream may be an ordered plurality of pictures. In a multi-class classification model, one sample belongs to and only belongs to one of a plurality of classes, and the different classes are mutually exclusive.

In some embodiments, a CNN (Convolutional Neural Networks, convolutional neural network) model in a multi-class classification model is used to extract depth features of each target picture as first features, so that feature extraction can be performed on main targets in each target picture.

In step 120, a second feature of each target picture is extracted by a multi-instance classification (multi-instance learning) model or a multi-label classification (multi-label classification) model. In the multi-label classification model, one sample can belong to a plurality of categories (or labels), and different categories are related, so that feature extraction can be performed on a plurality of targets in each target picture to obtain attribute features, namely second features, of the picture. Therefore, multiple targets can be identified, the generated sentence description is more consistent with the content of the picture, and logical continuity between sentences in the finally generated paragraph description is facilitated, so that the accuracy of the description is improved.

In step 130, the first feature and the second feature are input into a trained GAN model generator to determine a word selection probability distribution for each word in the vocabulary. The GAN model is composed of a generator and a discriminator, and the generator may be an RNN model for generating sentence descriptions and paragraph descriptions of pictures, for example.

In some embodiments, the generator of the GAN model includes a first RNN model and a second RNN model. For example, step 130 may be implemented by the embodiment of fig. 2.

Fig. 2 shows a flow chart of some embodiments of step 130 in fig. 1.

As shown in fig. 2, step 130 may include: step 1310, generating a theme vector of each target picture; and step 1320, determining a word choice probability distribution.

In step 1310, a first feature is input into a first RNN model, and a topic vector (topic vector) for each target picture is generated.

In some embodiments, depth features are sequentially input into the first RNN model to encode information of each picture, so as to obtain a series of encoded information, i.e., a topic vector. For example, at a first moment, the first RNN model takes depth features of a first picture as input, and outputs a network state as input at a next moment; at the later moment, the first RNN model takes depth characteristics of corresponding pictures and network states output at the previous moment as inputs, and outputs the network states at the current moment; the first RNN model takes the network state output at each moment as one piece of encoded information, and one piece of encoded information corresponds to one sentence description.

And fusing the first characteristics of each picture in the picture stream through a first RNN model to obtain the theme vector of each picture. The topic vectors are equivalent to the central ideas of the sentence descriptions corresponding to the pictures, and the topic vectors generated in this way are associated with each other, so that each sentence description in the finally formed paragraph description has logical continuity, and the accuracy is improved.

In step 1320, the second feature and topic vector are input into a second RNN model to determine a word choice probability distribution for each word in the vocabulary. For example, there are two words a and b in the vocabulary, and the probability distribution of the selected word output by the second RNN model for the current word is 0.6 and 0.4, respectively, i.e., a has a probability of 60% of being selected as the current word and b has a probability of 40% of being selected as the current word. Therefore, the multi-label classification feature and the related topic vector are fused to determine the word selection probability distribution of each word, so that the generated sentence description logic is more coherent, and the description accuracy is improved.

After the choice probability distribution is determined, sentence descriptions and paragraph descriptions can be generated by step 140 in fig. 1.

In step 140, the word with the highest probability in the word selection probability distribution is selected, and sentence descriptions of each target picture are generated, and the sentence descriptions of each target picture constitute paragraph descriptions of the picture stream.

In some embodiments, taking the coding information of the first target picture as an initial state of the second RNN model, taking the attribute characteristics of the first image as an input of a first moment of the second RNN model, and calculating an output of the first moment; and taking a preset sentence start symbol as input of a second moment of the second RNN model, outputting word selection probability distribution of a first word aiming at sentence description, and taking a word with the highest probability as the generated first word.

Taking the first word as input of a second RNN model at a third moment, outputting word selection probability distribution of the second word, and taking the word with the highest probability as the generated second word; and so on until an ending symbol is generated or a maximum sentence length is reached; and connecting the generated words to form a sentence description of the first target picture.

According to the generation process, sentence descriptions of other target pictures can be generated, and all sentence descriptions are connected to form paragraph descriptions of the picture stream.

In some embodiments, the generator of the GAN model may be pre-trained by pre-labeled vote probability distributions. For example, the word selection probability distribution is such that the probability of a pre-labeled word is set to 1 and the probability of other words is set to 0. For example, respectively inputting attribute characteristics of the target picture and coding information output by the first RNN model into the second RNN model; mapping hidden layer output of the second RNN model at each moment to dimensionality of the vocabulary through a full connection layer, so that word selection probability distribution of all words on the vocabulary is obtained; cross entropy loss of the obtained word selection probability distribution and the labeled word selection probability distribution is calculated; the parameters of the generator are updated by Back Propagation (Back Propagation) until the generator reaches the optimum for the pre-training phase, stopping training.

In some embodiments, the discriminators of the GAN model include a first classifier and a second classifier. The first classifier is used for judging whether the sentence description accords with the actual content of the corresponding target picture; the second classifier is used for judging whether the paragraph description accords with the language style of the pre-marked paragraph sample.

In some embodiments, both the first classifier and the second classifier may be implemented using a deep neural network. The output of the first classifier and the second classifier is the probability of judging the sentence description and the paragraph description as positive and negative categories, respectively, that is, the coincidence degree of the sentence description and the picture content or the coincidence degree of the paragraph description and the habit of human expression is obtained.

For example, some paragraph samples may be pre-labeled for determining whether the generated paragraph descriptions conform to the habit of human expression. The language style of the paragraph samples accords with the habit of human expression, for example, the language style can comprise various expression forms, rich words, logical consistency of front and rear sentences and the like.

In some embodiments, the discriminators of the GAN model may be trained with sentence descriptions and paragraph descriptions generated by the generator of the GAN model as negative samples, and pre-labeled sentence samples and paragraph samples as positive samples.

For example, the first classifier and the second classifier are trained by using a batch of sentence descriptions and paragraph seconds generated by the pre-trained generator as negative samples, and selecting a batch of sentence samples and paragraph samples from the labeled data as positive samples.

In some embodiments, the generator may be further formally trained using the classifier after training. For example, this may be achieved by the embodiment of fig. 3.

Fig. 3 illustrates a flow chart of some embodiments of the GAN model generator training method of the present disclosure.

As shown in fig. 3, the method includes: step 310, generating each training sample sentence; step 320, determining a positive class probability; step 330, determining a first weight of each word; step 340, updating the generator of the GAN model.

In step 310, the words in the vocabulary are sampled according to the words and the word selection probability distribution in the sentence description, and each training sample sentence corresponding to each word is generated. That is, after the sentence descriptions of the target pictures are generated, a corresponding training sample sentence may be generated based on each word in the sentence descriptions.

In some embodiments, the generated sentence description may have 20 words, where N is an integer between 1 and 20, where N is the first N words of the sentence description are fixed, the word selection probability distribution of the next word is calculated based on the fixed words, and the next word is determined by sampling in the vocabulary according to the word selection probability distribution until the entire training sample sentence is completed (where N is 20, the sentence description may be directly used as the training sample sentence).

Here, according to the word selection probability distribution sampling, it means that the words in the vocabulary obey the word selection probability distribution, and the words are randomly selected from the vocabulary, instead of selecting the words with the highest probability. For example, there are two words a and b in the vocabulary, and the probability distribution of the selection word output by the second RNN model for the current word is 0.6 and 0.4, respectively. In this case, the probability of sampling a as the target word is 60%, and the probability of sampling b as the target word is 40%, instead of directly selecting a as the target word.

Thus, each word in the sentence description generated by the generator has a training sample sentence generated only according to the word, and the weight of the training sample sentence is calculated by the discriminator, so that the weight of the corresponding word can be obtained for evaluating the correctness of the word. Compared with the prior art, the training generator is trained by using the sentence description as the whole accuracy, and the training generator can be trained more precisely and accurately by using the weight of the word as the basis of the training generator, so that the description accuracy is improved.

In some embodiments, step 310 may be implemented by the embodiment of fig. 4.

Fig. 4 shows a flow chart of some embodiments of step 310 in fig. 3.

As shown in fig. 4, step 310 may include: step 3110, determining words before the target word; step 3120, determining a word selection probability distribution; and step 3130, determining the word following the target word.

In step 3110, a word of the sentence description is selected as a target word, and words arranged in front of the target word in the sentence description are reserved as words at corresponding positions in the training sample sentence. For example, the first two words of the sentence description are fixed, i.e. the second word is selected as the target word, and the first two words of the training sample sentence are identical to the first two words of the sentence description.

In step 3120, the target word is input into the generator of the GAN model, and the word selection probability distribution of each word in the vocabulary is determined. For example, a second word is input into the generator, and a word selection probability distribution is determined.

In step 3120, the words in the vocabulary are sampled according to the word selection probability distribution to determine words in the training sample sentence that are arranged after the target word.

For example, the third word of the training sample sentence is determined by sampling in the vocabulary according to the word selection probability distribution. The third word may then be input into a generator that determines the fourth word by sampling. And so on until all words in the training sample sentence that are arranged after the second word are generated.

After the training sample sentences corresponding to the words in the sentence description are generated, the generator can be formally trained according to the rest steps in fig. 3.

In step 320, each training sample sentence is input to a discriminator of the GAN model, and a positive class probability of each training sample sentence is determined. For example, after each training sample sentence is input into the first classifier, the first classifier outputs a classification probability distribution of the training sample sentence, and records a positive classification probability.

In step 330, a first weight for each word corresponding to each training sample sentence is determined based on the positive class probability for each training sample sentence. For example, when determining each word in the sentence description, a word selection probability of each word is obtained, and the word selection probability of the word may be multiplied by a positive category probability of a training sample sentence generated according to the word as the first weight of the word.

In step 340, the generator of the GAN model is updated according to the first weight.

Fig. 5 illustrates a flow chart of further embodiments of the GAN model generator training methods of the present disclosure.

As shown in fig. 5, the method includes: step 510, generating each training sample paragraph; step 520, determining a positive class probability; step 530, determining a second weight of each word; step 540, update the generator of the GAN model.

In step 510, a training sample paragraph corresponding to each word is generated from each training sample sentence.

In some embodiments, a training sample sentence is selected as the target sentence. For example, a training sample sentence generated from the target word may be selected as the target sentence, and the position of the target sentence in the paragraph description may be recorded. Then, the target sentence is used as the sentence of the corresponding position in the training sample paragraph. Finally, each sentence description arranged in front of and behind the target sentence in the paragraph descriptions is used as the sentence of the corresponding position in the training sample paragraph.

In step 520, the training sample paragraphs are input to a discriminator of the GAN model, and positive class probabilities for each training sample paragraph are determined. For example, after each training sample segment is input into the second classifier, the second classifier outputs the classification probability distribution of the training sample segment, and records the positive classification probability.

In step 530, a second weight for each word corresponding to each training sample paragraph is determined based on the positive category probabilities for each training sample paragraph.

In step 540, the generator of the GAN model is updated according to the second weight.

In some embodiments, the degree of matching of the generated paragraph description and the pre-labeled paragraph sample may be calculated based on the weights of the sentence descriptions, the natural language processed evaluation index (e.g., METEOR, CIDEr, etc.).

Under the condition of low matching degree, an anti-learning strategy can be adopted to carry out iterative training and competition on the generator and the classifier. I.e. after each training of the generator, the training of the classifier is performed, so that the generator and the classifier obtain better effects. For example, the classifier of the GAN model is trained and updated by the method in any of the embodiments described above, and the generator of the GAN model is trained by the method in any of the embodiments described above until the degree of matching of the generated paragraph descriptions meets the requirements.

Fig. 6 illustrates a block diagram of some embodiments of a generation apparatus of a picture description of the present disclosure.

As shown in fig. 6, the generation apparatus 6 of the picture description includes an extraction unit 61, a determination unit 62, and a generation unit 63.

The extracting unit 61 extracts a first feature of each target picture in the picture stream by the multi-class classification model, and extracts a second feature of each target picture by the multi-instance classification model or the multi-label classification model.

The determining unit 62 inputs the first feature and the second feature into a generator of the trained GAN model, and determines a word selection probability distribution of each word in the vocabulary. For example, the generator of the GAN model includes a first RNN model and a second RNN model.

In some embodiments, the determining unit 62 inputs the first feature into a first RNN model, generates a topic vector for each target picture, inputs the second feature and the topic vector into a second RNN model, and determines a word selection probability distribution for each word in the vocabulary.

The generating unit 63 selects a word with the highest probability in the word selection probability distribution, and generates a sentence description of each target picture. The sentence descriptions of the respective target pictures constitute a paragraph description of the picture stream.

In some embodiments, the discriminators of the GAN model include a first classifier and a second classifier. The first classifier is used for judging whether the sentence description accords with the actual content of the corresponding target picture, and the second classifier is used for judging whether the paragraph description accords with the language style of the pre-marked paragraph sample.

In some embodiments, the generation means 6 of the picture description further comprises a training unit 64. The training unit 64 trains the discriminators of the GAN model with the sentence descriptions and the paragraph descriptions generated by the generator of the GAN model as negative samples and with the pre-labeled sentence samples and the paragraph samples as positive samples.

In some embodiments, the generating unit 63 samples words in the vocabulary according to the word and the word selection probability distribution in the sentence description, and generates each training sample sentence corresponding to each word. The determining unit 62 inputs each training sample sentence into the discriminator of the GAN model, determines the positive class probability of each training sample sentence, and determines the first weight of each word corresponding to each training sample sentence based on the positive class probability. The training unit 64 updates the generator of the GAN model according to the first weight.

In some embodiments, the generation unit 63 selects one word of the sentence description as the target word. The generating unit 63 retains each word arranged in front of the target word in the sentence description as a word at a corresponding position in the training sample sentence; the generating unit 63 inputs the target word into a generator of the GAN model, and determines a word selection probability distribution of each word in the vocabulary; the generating unit 63 samples each word in the vocabulary according to the word selection probability distribution to determine each word arranged after the target word in the training sample sentence.

In some embodiments, the generating unit 63 generates a training sample paragraph corresponding to each word from each training sample sentence. The determining unit 62 inputs the training sample paragraphs to the discriminator of the GAN model, determines the positive class probability of each training sample paragraph, and determines the second weight of each word corresponding to each training sample paragraph based on the positive class probability of each training sample paragraph. The training unit 64 updates the generator of the GAN model according to the second weight.

In some embodiments, the generating unit 63 selects one training sample sentence as the target sentence, the target sentence as the sentence at the corresponding position in the training sample paragraph, and each of the sentence descriptions arranged before and after the target sentence as the sentence at the corresponding position in the training sample paragraph.

Fig. 7 shows a block diagram of further embodiments of a generation apparatus of a picture description of the present disclosure.

As shown in fig. 7, the picture description generating apparatus 7 of this embodiment includes: a memory 71 and a processor 72 coupled to the memory 71, the processor 72 being configured to perform the method of generating a picture description in any one of the embodiments of the present disclosure based on instructions stored in the memory 71.

The memory 71 may include, for example, a system memory, a fixed nonvolatile storage medium, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), database, and other programs.

As shown in fig. 8, the picture description generating apparatus 8 of this embodiment includes: a memory 810 and a processor 820 coupled to the memory 810, the processor 820 being configured to perform the method of generating a picture description in any of the foregoing embodiments based on instructions stored in the memory 810.

Memory 810 may include, for example, system memory, fixed nonvolatile storage media, and the like. The system memory stores, for example, an operating system, application programs, boot Loader (Boot Loader), and other programs.

The generation means 8 of the picture description may further comprise an input-output interface 830, a network interface 840, a storage interface 850, etc. These interfaces 830, 840, 850 and the memory 810 and the processor 820 may be connected by, for example, a bus 860. The input/output interface 830 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, a touch screen, and the like. The network interface 840 provides a connection interface for various networking devices. Storage interface 850 provides a connection interface for external storage devices such as SD cards, U-discs, and the like.

It will be appreciated by those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Heretofore, a method of generating a picture description, a device of generating a picture description, and a computer-readable storage medium according to the present disclosure have been described in detail. In order to avoid obscuring the concepts of the present disclosure, some details known in the art are not described. How to implement the solutions disclosed herein will be fully apparent to those skilled in the art from the above description.

The methods and systems of the present disclosure may be implemented in a number of ways. For example, the methods and systems of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the present disclosure. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the disclosure. The scope of the present disclosure is defined by the appended claims.

Claims

1. A method of generating a picture description, comprising:

extracting first characteristics of each target picture in a picture stream through a multi-class classification model, wherein in a picture classification result of the multi-class classification model, one picture belongs to and only belongs to one of a plurality of first classes, and the plurality of first classes have mutual exclusion relations;

extracting second characteristics of each target picture through a multi-instance classification model or a multi-label classification model, wherein in a picture classification result of the multi-label classification model, one picture belongs to a plurality of second categories, and the plurality of second categories have association relations;

inputting the first characteristic and the second characteristic into a trained generator for generating an antagonism network GAN model, and determining word selection probability distribution of each word in a word list;

and selecting the word with the highest probability in the word selection probability distribution, and generating sentence descriptions of each target picture, wherein the sentence descriptions of each target picture form paragraph descriptions of the picture stream.

2. The method of generating according to claim 1, wherein,

the generator of the GAN model includes a first RNN model and a second RNN model;

the determining the word selection probability distribution of each word in the word list comprises the following steps:

inputting the first feature into the first RNN model to generate a theme vector of each target picture;

and inputting the second feature and the topic vector into the second RNN model, and determining the word selection probability distribution of each word in the word list.

3. The method of generating according to claim 1, wherein,

the discriminator of the GAN model comprises a first classifier and a second classifier, wherein the first classifier is used for judging whether the sentence description accords with the actual content of the corresponding target picture, and the second classifier is used for judging whether the paragraph description accords with the language style of the pre-marked paragraph sample.

4. The generation method according to claim 1, further comprising:

and training a discriminator of the GAN model by taking sentence descriptions and paragraph descriptions generated by a generator of the GAN model as negative samples and taking pre-labeled sentence samples and paragraph samples as positive samples.

5. The generation method according to any one of claims 1 to 4, further comprising:

according to each word in the sentence description and the word selection probability distribution, sampling the words in the word list to generate each training sample sentence corresponding to each word;

inputting each training sample sentence into a discriminator of the GAN model, and determining the positive category probability of each training sample sentence;

determining a first weight of each word corresponding to each training sample sentence according to the positive category probability of each training sample sentence;

and updating a generator of the GAN model according to the first weight of each word.

6. The generating method according to claim 5, wherein the generating each training sample sentence corresponding to each word includes:

selecting one word of the sentence description as a target word, and reserving each word arranged in front of the target word in the sentence description as a word at a corresponding position in the training sample sentence;

inputting the target word into a generator of the GAN model, and determining word selection probability distribution of each word in the word list;

and according to the word selection probability distribution, each word in the word list is sampled to determine each word arranged behind the target word in the training sample sentence.

7. The generation method according to claim 5, further comprising:

generating training sample paragraphs corresponding to the words according to the training sample sentences;

inputting the training sample paragraphs into a discriminator of the GAN model, and determining positive category probabilities of the training sample paragraphs;

determining a second weight of each word corresponding to each training sample paragraph according to the positive category probability of each training sample paragraph;

and updating the generator of the GAN model according to the second weight.

8. The generating method according to claim 7, wherein the generating each training sample paragraph corresponding to each word includes:

selecting one training sample sentence as a target sentence;

taking the target sentence as a sentence at a corresponding position in the training sample paragraph;

and taking each sentence description arranged in front of and behind the target sentence in the paragraph descriptions as a sentence at a corresponding position in the training sample paragraph.

9. A picture description generation apparatus, comprising:

the extraction unit is used for extracting first characteristics of each target picture in the picture stream through a multi-class classification model, extracting second characteristics of each target picture through a multi-instance classification model or a multi-label classification model, wherein one picture belongs to and only belongs to one of a plurality of first classes in a picture classification result of the multi-class classification model, the plurality of first classes have mutual exclusion relations, and one picture belongs to a plurality of second classes in a picture classification result of the multi-label classification model, and the plurality of second classes have association relations;

the determining unit is used for inputting the first characteristic and the second characteristic into a trained generator for generating an countermeasure network GAN model and determining word selection probability distribution of each word in the word list;

and the generating unit is used for selecting the word with the highest probability in the word selection probability distribution, generating the sentence description of each target picture, and forming the paragraph description of the picture stream by the sentence description of each target picture.

10. A picture description generation apparatus, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of generating a picture description of any of claims 1-8 based on instructions stored in the memory device.

11. A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of generating a picture description according to any of claims 1-8.