CN113673525A

CN113673525A - Method, system and device for generating Chinese text based on image

Info

Publication number: CN113673525A
Application number: CN202110823454.9A
Authority: CN
Inventors: 陈志华; 黄经赢; 刘斌; 魏文国
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2021-11-19

Abstract

The invention relates to the field of computer vision and the technical field of artificial intelligence, and discloses a method, a system and a device for generating a Chinese text based on an image. The method can reduce the characteristic quantity of the generative confrontation network model for one-time learning, reduce the training difficulty of the generative confrontation network model, and can be suitable for more complex images.

Description

Method, system and device for generating Chinese text based on image

Technical Field

The invention relates to the field of computer vision and the technical field of artificial intelligence, in particular to a method, a system and a device for generating a Chinese text based on an image.

Background

Text extraction of natural images has very wide application. The generation of Chinese text from images is achieved in the related art using a generative confrontation network model. The generative confrontation Network model is a generative model and comprises a Generator Network (Generator Network) and a Discriminator Network (Discriminator Network), and the Generator Network and the Discriminator Network compete with each other until equilibrium is reached.

At present, when a Chinese text describing an image is generated by adopting a generative confrontation network model, the whole image is directly input into a generative confrontation network, and for a more complex image, the generative confrontation network model needs to learn more characteristics at one time, so that certain network training difficulty exists.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a system and a device for generating a Chinese text based on an image, wherein the corresponding Chinese text is generated by separating a background region and a foreground region, so that the characteristic quantity of one-time learning of a generative confrontation network model is reduced, the training difficulty of the generative confrontation network model is reduced, and the method, the system and the device can be suitable for more complex images.

In order to achieve the purpose, the invention adopts the following technical scheme:

the application provides a method for generating a Chinese text from an image in a first aspect, and the method comprises the following steps:

segmenting an image of a Chinese text to be generated into a foreground region and a background region;

generating a first Chinese text for describing the foreground region and a second Chinese text for describing the background region by using a generative confrontation network;

and generating a Chinese text for describing the image according to the first Chinese text and the second Chinese text.

According to one possible implementation of the first aspect of the present application, the first chinese text is generated based on a trained first generative confrontation network model, and the second chinese text is generated based on a trained second generative confrontation network model.

According to one possible implementation of the first aspect of the present application, a mechanism of attention is added to the first generative confrontation network model when the first generative confrontation network model is trained, and/or,

adding an attention mechanism in the second generative confrontation network model when training the second generative confrontation network model.

According to an implementable manner of the first aspect of the present application, generating a chinese text for describing the image from the first chinese text and the second chinese text comprises:

splicing the first Chinese text and the second Chinese text into a third text;

and adjusting the third text according to a preset expression sentence pattern structure to generate an adjusted text conforming to one of the preset expression sentence patterns, and taking the adjusted text as the Chinese text for describing the image.

A second aspect of the present application provides an image generation chinese text system, the system comprising:

the image segmentation module is used for segmenting an image of the Chinese text to be generated into a foreground region and a background region;

a first generation module, configured to generate a first chinese text describing the foreground region and a second chinese text describing the background region by using a generative confrontation network;

and the second generating module is used for generating a Chinese text for describing the image according to the first Chinese text and the second Chinese text.

According to an implementable manner of the second aspect of the present application, the first generating module comprises:

the first Chinese text generation unit is used for generating a first Chinese text based on a trained first generative confrontation network model;

and the second Chinese text generation unit is used for generating the second Chinese text based on the trained second generative confrontation network model.

According to an enabling manner of the second aspect of the present application, the system further comprises a training module, the training module comprising:

a first training unit for adding an attentive force mechanism to the first generative confrontation network model when training the first generative confrontation network model, and/or,

and the second training unit is used for adding an attention mechanism in the second generative confrontation network model when the second generative confrontation network model is trained.

According to an implementable manner of the second aspect of the present application, the second generating module comprises:

the splicing unit is used for splicing the first Chinese text and the second Chinese text into a third text;

and the adjusting unit is used for adjusting the third text according to a preset expression sentence pattern structure to generate an adjusting text conforming to one of the preset expression sentence patterns, and taking the adjusting text as the Chinese text for describing the image.

A third aspect of the present application provides an apparatus for generating chinese text based on an image, the apparatus comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing a method for generating chinese text based on an image as described in any one of the above embodiments when executing the computer program.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed, implements a method of generating chinese text from an image as in any one of the embodiments above.

The embodiments disclosed in the present application have at least the following advantages:

the method is simple, convenient and easy to implement, the corresponding Chinese texts are generated by separating the background region and the foreground region, the characteristic quantity of the generative confrontation network model for one-time learning is reduced, the training difficulty of the generative confrontation network model is reduced, and the method can be suitable for more complex images.

Drawings

FIG. 1 is a schematic flow chart of a preferred embodiment of a method for generating Chinese text from images according to the present invention;

FIG. 2 is a schematic structural diagram of a preferred embodiment of an image-generating Chinese text system according to the present invention.

Reference numerals:

the image segmentation module 1, the first generation module 2 and the second generation module 3.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flow chart of a method for generating a chinese text from an image according to a preferred embodiment of the present invention.

As shown in fig. 1, the method includes:

s1 segments an image of the chinese text to be generated into a foreground region and a background region.

Specifically, the foreground region and the background region in the image can be distinguished according to the color information and the brightness information of the image pixels, so that the foreground region and the background region can be segmented.

S2 generates a first chinese text describing the foreground region and a second chinese text describing the background region using the generative confrontation network.

In one embodiment, the step S2 includes:

generating the first and second Chinese texts based on a trained first generative confrontation network model.

In another embodiment, the step S2 includes:

the first Chinese text is generated based on a trained first generative confrontation network model, and the second Chinese text is generated based on a trained second generative confrontation network model.

And carrying out countermeasure optimization training on the generated countermeasure network model through a training image of a known standard Chinese text to obtain the trained generated countermeasure network model. The generative confrontation network model comprises a generative model and a discrimination model, wherein the generative model is used for generating a corresponding Chinese text according to an image, and the discrimination model is used for discriminating whether the generated Chinese text is real data.

In particular, the generative model may employ an image encoder-text decoder architecture, the image encoder comprising a FasterR-CNN neural network, the text decoder comprising a two-layer LSTM network.

Based on this another embodiment, the step S2 further includes:

adding a mechanism of attention to the first generative confrontation network model while training the first generative confrontation network model, and/or,

In the present embodiment, the attention mechanism includes, for example, two aspects: it is decided which part of the input needs to be taken care of and limited information processing resources are allocated to the important part. The introduction of a mechanism of attention to the second image feature matrix may highlight more critical image portions of the second image feature matrix.

According to the embodiment of the application, by introducing the attention mechanism, the network can pay more attention to the information of the important area when generating the Chinese text, the redundancy of the network can be reduced, and the speed of generating the Chinese text is increased.

S3 generates a chinese text describing the image from the first chinese text and the second chinese text.

Wherein the step S3 includes:

splicing the first Chinese text and the second Chinese text into a third text;

According to the method and the device, the Chinese text for describing the image is generated according to the first Chinese text and the second Chinese text, and the method is simple and convenient to implement. The text is adjusted through the pre-preset sentence expression structure, so that the generated text can better accord with the Chinese language expression habit.

Wherein the first Chinese text and the second Chinese text can be spliced according to a preset splicing mechanism. For example, the stitching mechanism is to order the first Chinese text before the second Chinese text.

The preset expression sentence pattern comprises a main predicate structure sentence pattern, a main predicate object structure sentence pattern, a main form predicate object structure sentence pattern and a form main predicate object structure sentence pattern.

Specifically, since the composition in the sentence may include: in the step, the preset expression sentence pattern can be constructed by matching the composition components in the 'subject, predicate, object, fixed term and complement' in any number and in any order according to the requirement of the description image.

In a specific application scenario, for example, the first chinese text is "ship sailing", the second chinese text is "sea", and the first chinese text and the second chinese text may be concatenated to be "ship sailing sea", at which time the text of "ship sailing sea" needs to be adjusted, an appropriate expression sentence structure is selected, and the text is adjusted to be "ship sailing sea".

The embodiment of the application provides a method for generating a Chinese text from an image, which divides the image to be generated into a background area and a foreground area, respectively generates a first Chinese text corresponding to the foreground area and a second Chinese text corresponding to the foreground area by using a generative confrontation network, and further generates the Chinese text for describing the image based on the first Chinese text and the second Chinese text. The method is simple, convenient and easy to implement, the corresponding Chinese texts are generated by separating the background area and the foreground area, the characteristic quantity of the generative confrontation network model for one-time learning is reduced, the training difficulty of the generative confrontation network model is reduced, and the method can be suitable for more complex images.

The embodiment of the second aspect of the application provides a system for generating Chinese text based on images.

Fig. 2 is a schematic structural diagram of a preferred embodiment of the image-generating chinese text system according to the present invention, which can implement the entire process of a method for generating chinese text from an image according to any of the above embodiments.

As shown in fig. 2, the system includes:

the image segmentation module 1 is used for segmenting an image of a Chinese text to be generated into a foreground region and a background region;

a first generating module 2, configured to generate a first chinese text for describing the foreground region and a second chinese text for describing the background region by using a generative confrontation network;

and the second generating module 3 is used for generating a Chinese text for describing the image according to the first Chinese text and the second Chinese text.

The functions and implementation manners of the modules in the embodiment of the system are the same as those in the embodiment of the method for generating the Chinese text by using the image, and specific analysis can refer to the embodiment of the method for generating the Chinese text by using the image, so that repeated description is avoided.

The present application further provides an apparatus for generating chinese text based on an image, the apparatus comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing a method for generating chinese text based on an image as described in any one of the above embodiments when executing the computer program.

The present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed, the method for generating a chinese text from an image according to any one of the embodiments described above is implemented.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is the control center for the image-generating chinese text device, with various interfaces and lines connecting the various parts of the entire image-generating chinese text device.

The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the image-generating chinese text apparatus by running or executing the computer programs and/or modules stored in the memory and calling the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the module/unit integrated with the image generating Chinese text device can be stored in a computer readable storage medium if it is realized in the form of software functional unit and sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.

The foregoing is a preferred embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations are also regarded as the protection scope of the present application.

Claims

1. A method for generating chinese text from an image, the method comprising:

2. The method of claim 1, wherein the method comprises:

3. The method of claim 2, wherein the method comprises:

4. The method of claim 1, wherein the method comprises:

splicing the first Chinese text and the second Chinese text into a third text;

5. An image-generating chinese text system, the system comprising:

6. An image generation chinese text system according to claim 5, wherein the first generation module comprises:

7. The image-generating chinese text system of claim 6, further comprising a training module, the training module comprising:

8. The system of claim 5, wherein the second generating module comprises:

9. An image-generating chinese text apparatus comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, when executing the computer program, implementing an image-generating chinese text method as claimed in any one of claims 1 to 4.

10. A computer-readable storage medium, in which a computer program is stored which, when executed, implements a method of generating chinese text from an image as claimed in any one of claims 1 to 4.