CN113673349B

CN113673349B - Method, system and device for generating Chinese text by image based on feedback mechanism

Info

Publication number: CN113673349B
Application number: CN202110823453.4A
Authority: CN
Inventors: 陈志华; 刘斌; 徐省华; 魏文国
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2022-03-11
Anticipated expiration: 2041-07-20
Also published as: CN113673349A

Abstract

The invention relates to the technical field of text generation, and discloses a method, a system and a device for generating a Chinese text by an image based on a feedback mechanism, wherein the method utilizes the feedback mechanism when a generative confrontation network model is trained, obtains a corresponding reference image by using Chinese text description output by a generator, and feeds back the distance between the reference image and a sample image to the confrontation network, so that the generative confrontation network model is gradually optimized in the training process, and the accuracy of the image for generating the Chinese text is improved.

Description

Method, system and device for generating Chinese text by image based on feedback mechanism

Technical Field

The invention relates to the technical field of text generation, in particular to a method, a system and a device for generating a Chinese text by an image based on a feedback mechanism.

Background

As an important research direction in the field of natural language processing, the text generation technology has great application prospect. In the related art, a generative confrontation network model is used to process an image to generate a text description corresponding to the image. The Generative Adaptive Network (GAN) includes two submodels: a generator G and a discriminator D. The generator is used for simulating the distribution of real data, the discriminator is used for judging whether a sample is a real sample or a generated sample, and the training target of the network is to enable the generator to perfectly fit the distribution of the real data, so that the discriminator cannot distinguish.

However, the existing generative confrontation network model only trains the generative confrontation network model by using the sample image, and the accuracy of the text description generated by the trained generative confrontation network model is poor.

Disclosure of Invention

An object of the embodiments of the present invention is to provide a method, a system, and a device for generating a chinese text from an image based on a feedback mechanism, which can gradually optimize a generative confrontation network model described in the chinese text generated from the image in a training process, thereby improving the accuracy of the chinese text generated from the image.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present application provides a method for generating a chinese text based on an image of a feedback mechanism, the method including:

constructing a generative confrontation network model for generating Chinese text description through an image, wherein the generative confrontation network model comprises a generator and a discriminator;

inputting a sample image of known Chinese text description information into the generator, obtaining the Chinese text description output by the generator, and obtaining a corresponding reference image based on the output Chinese text description, wherein the image characteristics corresponding to the reference image are the same as the image characteristics corresponding to the Chinese text description information;

feeding back the reference image to the discriminator to cause the discriminator to calculate the distance between the sample image and the reference image;

and if the calculated distance is not smaller than a preset distance threshold, adding the distance into an objective function of the generating type confrontation network model, and adjusting the generator and the discriminator based on the objective function so as to guide the generator to generate a vector closer to a true value.

According to an implementable manner provided by the first aspect of the present application, the method further comprises:

constructing a first loss function of the generator according to the distance, and determining a first weighting value of the first loss function;

constructing a second loss function of the generator according to first probability information of the output Chinese text judged to be false by the discriminator, and determining a second weighted value of the second loss function;

constructing a loss function for the generator based on the first loss function, the second loss function, the first weighting value, and the second weighting value.

and constructing a loss function of the discriminator according to the probability information of the discriminator for discriminating the output Chinese text as true, and constructing the target function according to the loss function of the generator and the loss function of the discriminator.

the discriminator adopts a convolutional neural network to extract strongest semantic information, adds an attention mechanism to an input layer of the convolutional neural network to extract semantic information containing context, and further determines the probability of discriminating the output Chinese text to be true according to the strongest semantic information and the semantic information containing the context.

A second aspect of the present application provides a system for generating chinese text based on images of a feedback mechanism, the system comprising:

the model construction module is used for constructing a generative confrontation network model for generating Chinese text description through images, and the generative confrontation network model comprises a generator and a discriminator;

the generating module is used for inputting a sample image of known Chinese text description information into the generator, obtaining the Chinese text description output by the generator, and obtaining a corresponding reference image based on the output Chinese text description, wherein the image characteristics corresponding to the reference image are the same as the image characteristics corresponding to the Chinese text description information;

a feedback module, configured to feed back the reference image to the discriminator so that the discriminator calculates a distance between the sample image and the reference image;

and the adjusting module is used for adding the distance into an objective function of the generating type confrontation network model when the calculated distance is not smaller than a preset distance threshold, and adjusting the generator and the discriminator based on the objective function so as to guide the generator to generate a vector closer to a true value.

According to an implementable manner of the second aspect of the present application, the adjustment module comprises:

a first function construction unit, configured to construct a first loss function of the generator according to the distance, and determine a first weighting value of the first loss function;

a second function construction unit, configured to construct a second loss function of the generator according to first probability information that the discriminator discriminates that the output chinese text is false, and determine a second weighted value of the second loss function;

a third function construction unit for constructing a loss function of the generator based on the first loss function, the second loss function, the first weighting value, and the second weighting value.

According to an implementable manner of the second aspect of the present application, the adjusting module further comprises:

and the target function construction unit is used for constructing a loss function of the discriminator according to the probability information of the discriminator for discriminating the output Chinese text as true, and constructing the target function according to the loss function of the generator and the loss function of the discriminator.

According to a manner that can be implemented in the second aspect of the present application, the discriminator extracts the strongest semantic information by using a convolutional neural network, and adds an attention mechanism to an input layer thereof to extract semantic information containing context, and then determines a probability of discriminating that the output chinese text is true according to the strongest semantic information and the semantic information containing context.

A third aspect of the present application provides an apparatus for generating chinese text based on an image of a feedback mechanism, the apparatus comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, when executing the computer program, implementing a method for generating chinese text based on an image of a feedback mechanism as described in any one of the embodiments above.

A fourth aspect of the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed, implements the method for generating chinese text based on images of a feedback mechanism as described in any one of the above embodiments.

The embodiments disclosed in the present application have at least the following advantages:

the generative confrontation network model for generating the Chinese text description by the image can be gradually optimized in the training process, so that the accuracy of generating the Chinese text by the image is improved.

Drawings

FIG. 1 is a schematic flow chart of a method for generating Chinese text based on images of a feedback mechanism according to a preferred embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a preferred embodiment of the image-generating Chinese text system based on the feedback mechanism provided in the present application.

Reference numerals:

the device comprises a model building module 1, a generating module 2, a feedback module 3 and an adjusting module 4.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flow chart of a method for generating chinese text based on images of a feedback mechanism according to a preferred embodiment of the present invention.

As shown in fig. 1, the method includes:

s1 constructs a generative confrontation network model for generating chinese text description from the image, the generative confrontation network model including a generator and a discriminator.

In the embodiment of the application, the generator and the discriminator may not be limited to the neural network, but only the two may have functions that can be fitted to the respective generation and judgment, but are preferably neural network models.

S2, inputting a sample image of known Chinese text description information into the generator, obtaining the Chinese text description output by the generator, and obtaining a corresponding reference image based on the output Chinese text description, wherein the image characteristics corresponding to the reference image are the same as the image characteristics corresponding to the Chinese text description information.

The sample images of the known chinese text description information may be extracted from a preset training set. When a training set is constructed, images with Chinese text description information can be obtained.

Before a sample image with known Chinese text description information is input into the generator, the sample image can be subjected to necessary denoising processing so as to avoid noise of the sample image from influencing training of the generative confrontation network model.

Specifically, acquiring a corresponding reference image based on the output chinese text description includes: and inputting the output Chinese text description into a trained text generation image model, and further generating the reference image by the text generation image model. The text-generating image model may be a model based on a generative confrontation network, such as an existing StackGAN model, StackGAN + + model, AttnGAN model, or the like.

S3 feeds back the reference image to the discriminator to cause the discriminator to calculate the distance of the sample image from the reference image.

In this embodiment, the distance may be a cosine distance or a euclidean distance.

And S4, if the calculated distance is not less than a preset distance threshold, adding the distance into an objective function of the generating type confrontation network model, and adjusting the generator and the discriminator based on the objective function, so as to guide the generator to generate a vector closer to a true value.

It should be noted that, when the calculated distance is smaller than the preset distance threshold, the preset objective function may be used as the objective function of the generative confrontation network model.

It should be noted that the above-mentioned chinese text is described as a chinese text for describing an image. For example, the sample image is an image of two dogs, and the chinese text used to describe the image is text describing two dogs, which may be "two french bulls dogs on the grassland".

The discriminator is equivalent to a two-classifier, and can distinguish whether the input Chinese text is from a real question text or a text generated by the generator, and can distinguish whether the output Chinese text is the probability of the real Chinese text, and the like. The objective function may be determined based on the loss function of the generator, the arbiter. The generator and the discriminator can be adjusted and trained through the existing iterative training, and the precision of the generating type confrontation network model for generating Chinese text description by the image is improved.

It should be noted that, for adjusting the generator and the discriminator based on the objective function, various existing methods may be adopted, and the generator satisfying the expected value is obtained after adjustment, which is not limited in the embodiment of the present invention.

For the image description, the Chinese text description of the image is regenerated into the image, and if the distance between the two images is the minimum (the similarity is the highest), the Chinese text description of the image is the most accurate. The method is characterized in that a corresponding feedback mechanism is constructed based on the principle, the feedback mechanism acquires a corresponding reference image according to Chinese text description generated based on a sample image, further calculates the distance between the reference image and the sample image, and adds the distance to a target function of the generated countermeasure network model when the distance is less than optimal. Through the feedback mechanism, the generative confrontation network model for generating the Chinese text description by the image can be gradually optimized in the training process, so that the accuracy of generating the Chinese text by the image is improved.

After the generative confrontation network model is trained by the method, the target image needing to generate the Chinese text description can be input into the trained generator, so that the Chinese text description of the target image is obtained.

In one embodiment, the method further comprises:

The loss function of the generator is determined by carrying out weighted summation on the first loss function and the second loss function.

The specific values of the first weighted value and the second weighted value both satisfy more than 0 and less than 1. In some embodiments, the specific values of the first weight and the second weight are both 0.5.

In one embodiment, the method further comprises:

When the discriminator discriminates whether the output Chinese text is true or false, the method specifically executes the following steps:

the discriminator compares the Chinese text description output by the generator with the known Chinese text description of the corresponding sample image, if the Chinese text description output by the generator is determined to be the known Chinese text description, the output Chinese text is determined to be true, and if the Chinese text description output by the generator is determined not to be the known Chinese text description, the output Chinese text is determined to be false.

In one embodiment, the discriminator uses a convolutional neural network to extract the strongest semantic information, and adds an attention mechanism to an input layer thereof to extract semantic information containing context, and further determines the probability of discriminating the output chinese text to be true according to the strongest semantic information and the semantic information containing context. According to the embodiment of the invention, the discrimination network can obtain richer semantic and context information through the setting, so that the performance of the discrimination network is optimized.

The embodiment of the second aspect of the application provides an image generation Chinese text system based on a feedback mechanism.

Fig. 2 is a schematic structural diagram of a preferred embodiment of the feedback mechanism-based image generation chinese text system according to the present invention, which can implement the whole process of the feedback mechanism-based image generation chinese text method according to any of the above embodiments.

As shown in fig. 2, the system includes:

the model building module 1 is used for building a generative confrontation network model for generating Chinese text description through images, and the generative confrontation network model comprises a generator and a discriminator;

the generating module 2 is configured to input a sample image of known chinese text description information into the generator, obtain a chinese text description output by the generator, and obtain a corresponding reference image based on the output chinese text description, where an image feature corresponding to the reference image is the same as an image feature corresponding to the chinese text description information;

a feedback module 3, configured to feed back the reference image to the discriminator, so that the discriminator calculates a distance between the sample image and the reference image;

and the adjusting module 4 is configured to add the distance to an objective function of the generative confrontation network model when the calculated distance is not less than a preset distance threshold, and adjust the generator and the discriminator based on the objective function, so as to guide the generator to generate a vector closer to a true value.

According to an implementable manner of the second aspect of the embodiment of the present application, the adjusting module includes:

According to an implementable manner of the second aspect of the embodiment of the present application, the adjusting module further includes:

According to a manner that can be implemented in the second aspect of the embodiment of the present application, the discriminator extracts the strongest semantic information by using a convolutional neural network, and adds an attention mechanism to an input layer thereof to extract semantic information containing context, and then determines a probability of discriminating that the output chinese text is true according to the strongest semantic information and the semantic information containing context.

The functions and implementation manners of the modules of the embodiment of the system are the same as those of the embodiment of the method for generating the Chinese text based on the image of the feedback mechanism, and the specific analysis can refer to the embodiment of the method for generating the Chinese text based on the image of the feedback mechanism, so that the details are not repeated herein to avoid repetition.

The application also provides an image generation Chinese text device based on a feedback mechanism, which comprises a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the processor executes the computer program to realize the image generation Chinese text method based on the feedback mechanism according to any one of the embodiments.

The present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed, the method for generating a chinese text based on an image of a feedback mechanism as described in any of the above embodiments is implemented.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor is a control center of the image generation Chinese text device based on the feedback mechanism, and various interfaces and lines are used for connecting various parts of the whole image generation Chinese text device based on the feedback mechanism.

The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the apparatus for generating chinese text based on images of the feedback mechanism by running or executing the computer programs and/or modules stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the integrated module/unit of the image generation Chinese text device based on the feedback mechanism can be stored in a computer readable storage medium if the module/unit is realized in the form of a software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

The foregoing is a preferred embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations are also regarded as the protection scope of the present application.

Claims

1. The method for generating the Chinese text based on the image of the feedback mechanism is characterized by comprising the following steps:

2. The method for generating chinese text based on images of feedback mechanism as claimed in claim 1, wherein said method further comprises:

3. The method for generating chinese text based on images of feedback mechanism as claimed in claim 2, wherein said method further comprises:

4. The method for generating chinese text based on images of feedback mechanism as claimed in claim 3, wherein said method further comprises:

5. A system for generating chinese text from images based on a feedback mechanism, the system comprising:

6. The feedback mechanism based image generation chinese text system of claim 5, wherein the adjustment module comprises:

7. The feedback mechanism based image generation chinese text system of claim 6, wherein the adjustment module further comprises:

8. The feedback mechanism based image generation chinese text system of claim 7, wherein:

9. The apparatus for generating Chinese text based on image of feedback mechanism, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the processor, when executing the computer program, implements the method for generating Chinese text based on image of feedback mechanism as claimed in any one of claims 1-4.

10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, which when executed implements the method of generating chinese text based on images of a feedback mechanism as claimed in any one of claims 1 to 4.