CN114926835A

CN114926835A - Text generation method and device, and model training method and device

Info

Publication number: CN114926835A
Application number: CN202210563383.8A
Authority: CN
Inventors: 李业豪; 潘滢炜; 姚霆; 梅涛
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-08-19

Abstract

The disclosure provides a text generation method and a text generation device and a model training method and a device, and relates to the technical field of computer vision. The text generation method comprises the following steps: extracting visual features of an image to be processed; acquiring a related text of an image to be processed; coding the related text of the image to be processed to obtain the related semantic features of the image to be processed; and generating a description text of the image to be processed according to the visual characteristic of the image to be processed and the related semantic characteristic of the image to be processed. Through the steps, the accuracy of the generated image description text can be improved.

Description

Text generation method and device, and model training method and device

Technical Field

The disclosure relates to the technical field of computer vision, in particular to a text generation method and a model training method and device.

Background

Image description technology is one of the fundamental topics in the computer vision and language fields. The image description means that descriptive sentences are automatically generated for the image, the descriptive sentences can include the semantic contents of the image, and the semantic contents are described in a proper sequence.

The image description mainly adopts an encoding-decoding based method. In the related art, a pre-trained object detector or classifier is often used as an encoder to extract image features, and a Recurrent Neural Network (RNN) or attention-based Neural Network model such as a Transformer is used as a decoder to decode the extracted image features and generate image description sentences.

Disclosure of Invention

One technical problem to be solved by the present disclosure is to provide a solution that can improve the accuracy of the generated image description text.

According to a first aspect of the present disclosure, a text generation method is provided, including: extracting visual features of an image to be processed; acquiring a related text of an image to be processed; coding the related text of the image to be processed to obtain the related semantic features of the image to be processed; and generating a description text of the image to be processed according to the visual characteristic of the image to be processed and the related semantic characteristic of the image to be processed.

In some embodiments, the obtaining the text related to the image to be processed includes: determining the similarity between the image to be processed and the existing texts in the training text set; and selecting a related text of the image to be processed from the existing texts according to the similarity.

In some embodiments, determining the similarity between the image to be processed and the existing text includes: extracting global features of an image to be processed and global features of an existing text; and calculating the cosine similarity of the global features of the image to be processed and the global features of the existing text, and taking the cosine similarity as the similarity of the image to be processed and the existing text.

In some embodiments, encoding the text related to the image to be processed to obtain the related semantic features of the image to be processed includes: determining a related word sequence of the image to be processed according to the related text of the image to be processed; and coding the related word sequence based on a semantic solver to obtain related semantic features of the image to be processed, wherein the semantic solver is a trained neural network model based on an attention machine system.

In some embodiments, encoding the related word sequence of the image to be processed based on the semantic comprehension device to obtain the related semantic features of the image to be processed includes: splicing the related word sequence of the image to be processed and the additional memory parameters to obtain an input word sequence; carrying out context coding on the input word sequence based on an attention mechanism to obtain semantic features fused with context information; and carrying out semantic enhancement on the semantic features fused with the context information based on a cross attention mechanism under the assistance of the visual features of the image to be processed so as to obtain the related semantic features of the image to be processed.

In some embodiments, further comprising: acquiring a related word sequence of a sample image; training a neural network model based on an attention mechanism according to the related word sequence of the sample image and a preset loss function to obtain a semantic solver, wherein the loss function is constructed by taking semantic words which are irrelevant to the sample image in the related word sequence of the sample image and are lost in reconstruction as targets.

In some embodiments, training the attention-based neural network model according to the related word sequence of the sample image and the preset loss function comprises: splicing the related word sequence of the sample image and the initialized memory parameter to obtain an input word sequence; inputting the input word sequence into a neural network model based on an attention mechanism to obtain output semantic features, wherein the output semantic features comprise a plurality of semantic word features; performing linear layer projection on the output semantic features to determine the probability distribution of each semantic word feature in the output semantic features on a semantic vocabulary; calculating the value of a loss function according to the probability distribution of each semantic word feature in the output semantic features on a semantic vocabulary; and optimizing the neural network model based on the attention mechanism according to the value of the loss function to obtain a semantic solver.

In some embodiments, encoding the text related to the image to be processed to obtain the related semantic features of the image to be processed further includes: determining the position code participated by each semantic word feature in the semantic features output by the semantic solver; and fusing the semantic word features and the position codes participated in the semantic word features to obtain fused semantic word features, and taking the whole formed by all the fused semantic word features as the related semantic features of the image to be processed.

In some embodiments, determining the position code in which each semantic word feature participates in the semantic features output by the semantic comprehensiors comprises: for each semantic word feature, determining the attention distribution of all position codes of the semantic word feature in the position code sequence; and according to the attention distribution, aggregating all position codes in the position code sequence to obtain the position codes participated by the semantic word characteristics.

In some embodiments, generating the description text of the image to be processed according to the visual features of the image to be processed and the related semantic features of the image to be processed includes: and processing the visual features of the image to be processed and the related semantic features of the image to be processed based on a text decoder to obtain a description text of the image to be processed, wherein the text decoder is a trained neural network model adopting an attention mechanism.

In some embodiments, processing the visual features of the image to be processed and the associated semantic features of the image to be processed based on the text decoder to obtain the description text of the image to be processed includes: performing feature fusion on the text features input at the current decoding moment and predicted descriptors of the image to be processed based on an attention mechanism to obtain first semantic features; performing semantic enhancement on the text features input at the current decoding moment based on a cross attention mechanism under the assistance of the visual features of the image to be processed and the related semantic features of the image to be processed to obtain second semantic features; fusing the first semantic features and the second semantic features to obtain fused semantic features; determining probability distribution of each semantic word feature in the text features input at the current decoding moment according to the fused semantic features; determining the next descriptor of the image to be processed according to the probability distribution; and after all the descriptors of the image to be processed are obtained, taking the ordered sequence formed by all the descriptors as the description text of the image to be processed.

In some embodiments, extracting visual features of the image to be processed comprises: extracting local features and global features of an image to be processed; and determining the visual characteristics of the image to be processed according to the local characteristics and the global characteristics of the image to be processed.

In some embodiments, local features and global features of the image to be processed are extracted using a text image contrast pre-trained model.

In some embodiments, determining the visual characteristic of the image to be processed from the local characteristic and the global characteristic of the image to be processed comprises: mapping the local features and the global features of the image to be processed to a new feature space, and splicing the mapped local features and the mapped global features; and coding the spliced image features based on a visual coder to obtain the visual features of the image to be processed, wherein the visual coder is a trained neural network model stacked with a plurality of layers of coding blocks adopting a self-attention mechanism.

In some embodiments, encoding the stitched image features based on the visual encoder to obtain the visual features of the image to be processed includes: coding the spliced image features based on the coding blocks of the multilayer self-attention mechanism to obtain the local features after multilayer coding and the global features after multilayer coding; splicing and fusing the global features output by the coding blocks of each layer of the self-attention mechanism to obtain the overall global features; and splicing the overall global features and the multilayer coded local features to obtain the visual features of the image to be processed.

According to a second aspect of the present disclosure, a model training method is proposed, comprising: extracting visual features of the sample image; acquiring a related text of a sample image; coding the related text of the sample image to obtain the related semantic features of the sample image; and carrying out supervised training on the neural network model based on the attention mechanism according to the visual characteristics of the sample images and the related semantic characteristics of the sample images to obtain a text decoder, wherein the text decoder is used for generating the image description text.

According to a third aspect of the present disclosure, a text generation apparatus is provided, including: the characteristic extraction module is configured to extract visual characteristics of the image to be processed; the text acquisition module is configured to acquire a related text of the image to be processed; the text coding module is configured to code the related texts of the images to be processed so as to obtain the related semantic features of the images to be processed; and the generation module is configured to generate a description text of the image to be processed according to the visual features of the image to be processed and the related semantic features of the image to be processed.

According to a fourth aspect of the present disclosure, a model training apparatus is provided, including: a feature extraction module configured to extract visual features of the sample image; the text acquisition module is configured to acquire relevant texts of the sample image; the text coding module is configured to code the related texts of the sample images to obtain the related semantic features of the sample images; and the training module is configured to perform supervised training on the neural network model based on the attention mechanism according to the visual features of the sample images and the related semantic features of the sample images to obtain a text decoder, wherein the text decoder is used for generating the image description text.

According to a fifth aspect of the present disclosure, there is also provided a text generation apparatus, including: a memory; and a processor coupled to the memory, the processor configured to perform the text generation method as described above based on the instructions stored in the memory.

According to a sixth aspect of the present disclosure, there is also provided a model training apparatus, comprising: a memory; and a processor coupled to the memory, the processor configured to perform the model training method as described above based on instructions stored in the memory.

According to a seventh aspect of the present disclosure, a computer-readable storage medium is also proposed, on which computer program instructions are stored, which instructions, when executed by a processor, implement the text generation method or the model training method described above.

Compared with the related technology, in the embodiment of the disclosure, the related text of the image to be processed is obtained and encoded to obtain the related semantic features of the image to be processed, and the description text of the image to be processed is generated with the aid of the visual features of the image to be processed and the related semantic features of the image to be processed, so that the accuracy and the grammar consistency of the generated image description text can be improved.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

fig. 1 is a flow diagram of a text generation method according to some embodiments of the present disclosure.

Fig. 2a is a schematic flow chart of extracting visual features of an image according to some embodiments of the present disclosure.

Fig. 2b is a schematic flow chart of obtaining text related to an image according to some embodiments of the present disclosure.

Fig. 2c is a schematic flow diagram of encoding relevant text of an image according to some embodiments of the present disclosure.

FIG. 3 is a schematic flow diagram of a semantic solver through training, according to some embodiments of the present disclosure.

FIG. 4 is a flow diagram illustrating encoding of text associated with an image according to further embodiments of the present disclosure.

FIG. 5 is a flow diagram of a model training method according to some embodiments of the present disclosure.

Fig. 6 is a schematic structural diagram of a text generation apparatus according to some embodiments of the present disclosure.

FIG. 7 is a schematic diagram of a model training apparatus according to some embodiments of the present disclosure.

FIG. 8 is a schematic diagram of a structure of a text generation apparatus or a model training apparatus according to further embodiments of the present disclosure.

FIG. 9 is a block diagram of a computer system according to some embodiments of the present disclosure.

Detailed Description

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

To make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be described in further detail below with reference to specific embodiments and the accompanying drawings.

The text generation method in the related art relies too much on language prior knowledge in training data, so that the generated sentences are difficult to highlight obvious semantic information in the images, and an object illusion phenomenon sometimes occurs, namely, some semantic words which do not appear in the images are generated, thereby limiting the performance of the image description model.

Fig. 1 is a flow diagram of a text generation method according to some embodiments of the present disclosure. As shown in fig. 1, the method includes:

step S110: and extracting the visual features of the image to be processed.

In step S110, the image to be processed is encoded to obtain the visual characteristics of the image to be processed.

Therein, visual features of an image may be represented by an ordered set of numerical values having fixed dimensions, such as vectors. For example, the visual characteristic of an image is represented as

Wherein,

is a visual feature vector of an image, v ₁ 、v ₂ 、v _n For different dimensions of the visual feature vector, v ₁ 、v ₂ And v _n Or may be a vector.

Step S130: and acquiring a related text of the image to be processed.

In some embodiments, the related text of the image to be processed is one or more sentences related to the image to be processed.

Step S150: and coding the related text of the image to be processed to obtain the related semantic features of the image to be processed.

In some embodiments, relevant sentences of the image to be processed are encoded to derive characteristics of the relevant sentences. Wherein the characteristics of the relevant sentences can be adoptedRepresented by an ordered set of values with fixed dimensions, such as a vector. For example, a relevant sentence is characterized by

Wherein,

is a sentence feature vector, s ₁ 、s ₂ 、s _m Is the word vector that constitutes the sentence feature vector.

Step S170: and generating a description text of the image to be processed according to the visual characteristic of the image to be processed and the related semantic characteristic of the image to be processed.

In some embodiments, the visual features of the image to be processed and the associated semantic features of the image to be processed are processed based on a text decoder to generate a description text of the image to be processed. Wherein, the text decoder is a trained neural network model adopting an attention mechanism. For example, a trained neural network, stacked with a plurality of masked multi-attention-based decoding modules, is used as a text decoder.

The text decoder generates descriptive sentences of the images to be processed by integrating the visual features of the images to be processed and the related semantic features of the images to be processed. In the embodiment of the disclosure, the text decoder can obtain correct visual and semantic information with the assistance of the relevant semantic features and visual features of the image to be processed, reduce the dependence on language prior knowledge in training data, and improve the accuracy and grammar consistency of the generated image description text.

In some embodiments, processing the visual features of the image to be processed and the related semantic features of the image to be processed based on the text decoder to obtain the description text of the image to be processed includes: step S171 to step S176.

Step S171: and performing feature fusion on the text features input at the current decoding moment and predicted descriptors of the image to be processed based on a multi-head self-attention mechanism to obtain a first semantic feature.

In some embodiments, the text features of the input are obtained by: a sentence corresponding to each training sample image I is represented as S ═ w ₀ ,w ₁ ,…,w _T-1 In which w ₀ 、w ₁ 、w _T-1 For words in the sentence, T, representing the length of the sentence, each word in the sentence S is encoded into a one-hot vector and further encoded to obtain a text feature vector

Wherein,

is a feature vector (word vector for short) of a word in a sentence. The text decoder takes the text characteristics as input, and predicts each descriptor of the image to be processed in turn according to the visual characteristics of the image to be processed and the related semantic characteristics of the image to be processed.

Illustratively, at the t-th decoding instant, the masked multi-headed attention layer in the i-th decoding module is based on the hidden state vector for the previous output

Based on the self-attention mechanism, performing feature fusion on the text feature input at the current decoding moment and the predicted word vector of the descriptor of the image to be processed to obtain a first semantic feature h' _t ⁱ Specifically, multi-headed self-attention may be performed according to the following formula:

wherein, h' _t ⁱ A first semantic feature is represented that is,

show firstThe hidden state vector of the previous output is,

a word vector representing descriptors of the predicted image to be processed, and MultiHead () representing execution of multi-head self-attention.

Step S172: and carrying out semantic enhancement on the text features input at the current decoding moment based on a multi-head cross attention mechanism under the assistance of the visual features of the image to be processed and the related semantic features of the image to be processed so as to obtain second semantic features.

In some embodiments, at the t-th decoding instant, the multi-headed cross attention layer in the i-th decoding module outputs the hidden state vector according to the previous output

Respectively carrying out cross attention on the visual characteristic of the image to be processed and the related semantic characteristic of the image to be processed to obtain a second semantic characteristic

Specifically, multi-headed cross attention may be performed according to the following formula:

wherein,

a second semantic feature is represented that is,

representing the previously output hidden state vector,

representing the visual characteristics of the image to be processed,

representing images to be processedThe relevant semantic features of (a) are,

indicating a multi-headed cross-attention to the visual features of the image to be processed,

showing the multi-head cross attention of the relevant semantic features of the image to be processed.

Step S173: and fusing the first semantic features and the second semantic features to obtain fused semantic features.

In some embodiments, at the tth decoding time, the ith decoding module fuses the first semantic feature and the second semantic feature by using a sigmoid gate function, thereby obtaining an output of the ith decoding module

The sigmoid function is also called a Logistic function and is used for hidden layer neuron output, the value range is (0,1), and the sigmoid function can map a real number to an interval of (0, 1).

In some embodiments, the output of the ith decoding module is obtained according to the following formula

Wherein,

represents the hidden state vector output by the i-th decoding block, norm () represents the normalization operation,

representing a second semantic feature, h' _t ⁱ A first semantic feature is represented that is,

representing the hidden state of the previous output, sigmoid () representing the sigmoid gate function, W _g Representing a network parameter.

At the t-th decoding time, each decoding module sequentially processes from step S171 to step S173 to obtain the hidden state vector output by the last decoding module

I.e. the fused semantic features.

In the embodiment of the present disclosure, the text decoder can further improve the accuracy of the generated image description sentence by performing cross-attention and feature fusion on the visual features and the semantic features of the image to be processed in the manner shown in steps S171 to S173.

Step S174: and determining the probability distribution of each semantic word feature in the text features input at the current decoding moment according to the fused semantic features.

In some embodiments, the fused semantic features are processed based on a normalization index (softmax) function to obtain a probability distribution of each semantic word feature.

Step S175: and determining the next descriptor of the image to be processed according to the probability distribution.

In step S175, the semantic word with the highest probability value is used as the next descriptor of the image to be processed. Then, the predicted descriptor of the image to be processed is spliced to the rear of the previously predicted preamble sentence to form the latest preamble sentence. Then, steps S171 to S175 are cyclically executed until the end marker is predicted.

Step S176: and after all the descriptors of the image to be processed are obtained, taking the ordered sequence formed by all the descriptors as the description text of the image to be processed.

In the embodiment of the disclosure, the relevant text of the image to be processed is obtained, the relevant text of the image to be processed is encoded to obtain the relevant semantic features of the image to be processed, and the description text of the image to be processed is generated with the assistance of the visual features of the image to be processed and the relevant semantic features of the image to be processed, so that the dependence on the prior information of the text can be reduced, and the accuracy and the grammar consistency of the generated image description text can be improved.

Fig. 2a is a schematic flow chart of extracting visual features of an image according to some embodiments of the present disclosure. The flow shown in fig. 2a is an exemplary implementation of step S110. As shown in fig. 2a, a process of extracting visual features of an image according to an embodiment of the present disclosure includes:

step S111: and extracting local features and global features of the image to be processed.

The global features of the image refer to the overall features of the image, and the common global features include color features, texture features, shape features and the like. The local features of the image are features extracted from the image locally, such as edges, corners, lines, regions, and the like.

In some embodiments, a text Image contrast Pre-Training (CLIP) model is used to extract local features (such as lattice features) and global features of the Image to be processed. Specifically, the local features and the global features of the image to be processed are extracted using an image encoder in the CLIP model.

The CLIP model is a model obtained by training a data set by using a super-large-scale picture text crawled from a network by using an automatic supervision algorithm. The features coded by the image coder in the model contain richer visual information, and the model does not need to use a predefined good label range during pre-training, so that the semantic understanding capability of the visual features is not restricted, the visual feature extraction effect of the image to be processed is improved, and the accuracy of the finally generated image description statement is improved.

In other embodiments, local and global features of the image are extracted using a pre-trained object detector or classifier.

Step S112: and determining the visual characteristics of the image to be processed according to the local characteristics and the global characteristics of the image to be processed.

In some embodiments, step S112 includes: mapping the local features and the global features of the image to be processed to a new feature space, and splicing the mapped local features and the mapped global features; and coding the spliced image features based on a visual coder to obtain the visual features of the image to be processed, wherein the visual coder is a trained neural network model stacked with a plurality of layers of coding blocks adopting an attention mechanism.

In some embodiments, a fully connected layer is used to map the local features and global features of the image to be processed to a feature space more suitable for the image description text generation task, so as to improve the effect of the finally generated image description text.

In some embodiments, encoding the stitched image features based on the visual encoder comprises: coding the spliced image features based on the coding blocks of the multilayer self-attention mechanism to obtain multilayer coded local features and multilayer coded global features; splicing and fusing the global features output by the coding blocks of each layer of the self-attention mechanism to obtain the overall global features; and splicing the overall global features and the multilayer coded local features to obtain the visual features of the image to be processed.

For example, assume that the stitched image features are represented as

Wherein,

a global feature vector representing the mapped image,

n representing a mapped image _i Local feature vector, image feature after splicing

Input a stack of N _v The visual encoder of the layer based on the coding block of the self-attention mechanism performs mutual fusion operation among the characteristics to obtain the visual characteristics after the fusion improvement

Meanwhile, the global features output by each layer of the inter-vision encoder are spliced and fused to obtain the overall global features

Finally combined into final visual characteristics

I.e. the visual characteristics of the image to be processed.

The attention mechanism can be described as a process of mapping a query (query) vector and a series of key-value pair (key-value) vectors to an output vector, and the output vector is a weighted sum of weights applied to the value (value) vector calculated from the semantic query (query) vector and the key (key) vector. Illustratively, the output of the attention layer may be calculated by the following matrix operation formula:

wherein Attention (Q, K, V) represents the output of the Attention layer, Q represents the query matrix, K represents the key matrix, V represents the value matrix,

are predefined parameters.

In the visual encoder, the query vector, the key vector, and the value vector of the self-attention mechanism-based encoded block are all stitched image features.

In the embodiment of the disclosure, the global features and the local features of the image to be processed are extracted, and the global features and the local features of the image are spliced and fused to obtain the visual features of the image to be processed, so that richer visual features can be extracted, and the accuracy of the subsequent generation of the image description text is improved.

Fig. 2b is a schematic flow chart of obtaining text related to an image according to some embodiments of the present disclosure. The flow shown in fig. 2b is an exemplary implementation of step S130. As shown in fig. 2b, the process of acquiring relevant text of an image according to the embodiment of the present disclosure includes:

step S131: and determining the similarity between the image to be processed and the existing texts in the training text set.

In some embodiments, the similarity between the image to be processed and the existing text is determined according to the following method: extracting global features of the image to be processed and global features of existing texts in the training text set; and calculating the cosine similarity of the global features of the image to be processed and the global features of the existing texts in the training text set, and taking the cosine similarity as the similarity between the image to be processed and the existing texts.

In some embodiments, the sentences in the training set are used as the existing texts, the global features of the existing texts are extracted in advance based on the CLIP model, and the global features of all the existing texts are stored. When an image to be processed is processed, extracting the global feature of the image to be processed based on a CLIP model, then calculating the cosine similarity between the global feature of the image to be processed and the global feature of the existing text, and taking the cosine similarity as the similarity between the image to be processed and the existing text.

Step S132: and selecting a related text of the image to be processed from the existing texts according to the similarity.

In some embodiments, K sentences with the highest similarity to the image to be processed are taken as relevant texts of the image to be processed, wherein K is an integer greater than or equal to 1.

In other embodiments, sentences with similarity greater than or equal to a preset threshold with the image to be processed are taken as relevant texts of the image to be processed.

In the embodiment of the disclosure, the cross-modal retrieval can efficiently and accurately obtain the relevant text of the image to be processed, thereby being beneficial to improving the accuracy of the image description text generated under the assistance of the relevant text of the image to be processed.

Fig. 2c is a schematic flow diagram of encoding relevant text of an image according to some embodiments of the present disclosure. The flow shown in fig. 2c is an exemplary implementation of step S150. As shown in fig. 2c, the process of encoding the relevant text of the image according to the embodiment of the present disclosure includes:

step S151: and determining a related word sequence of the image to be processed according to the related text of the image to be processed.

In some embodiments, the relevant text of the image to be processed is one or more sentences. And processing the sentences by removing stop words and the like so as to obtain a related word sequence of the image to be processed.

Step S152: and coding the related word sequence based on a semantic solver to obtain the related semantic features of the image to be processed.

Wherein, the semantic comprehension device is a trained neural network model based on an attention mechanism. For example, a trained neural network model of attention-based transform coding blocks stacked with Ns layers is employed.

In some embodiments, a semantic comprehensitor is used to filter out semantic words in the sequence of related words that are not related to the image to be processed, while reconstructing more related but missing semantic words. In these embodiments, step S152 includes step a1 through step a 3.

Step a 1: and splicing the related word sequence of the image to be processed and the additional memory parameters to obtain an input word sequence.

The related word sequence of the image to be processed is composed of feature vectors of a plurality of related words. Where the additional memory parameters are a set of learnable query parameters (i.e., a set of slots) that are randomly initialized before the model training begins and updated as the model is iteratively learned. And after the model training is finished, storing the final query parameters which are used as memory parameters for splicing the related word sequences of the images to be processed.

And a2, context coding the input word sequence based on the multi-head self-attention mechanism to obtain semantic features fused with context information.

In some embodiments, the query vector, the key vector, and the value vector are related word sequences of the image to be processed when context coding the input word sequence based on a multi-headed self-attention mechanism.

A3, carrying out semantic enhancement on the semantic features fused with the context information based on a multi-head cross attention mechanism with the aid of the visual features of the image to be processed to obtain the related semantic features of the image to be processed.

In some embodiments, when semantic enhancement is performed on the semantic features fused with the context information based on the multi-head cross attention mechanism, the semantic features fused with the context information are used as query vectors, and the visual features of the image to be processed are used as key vectors and value vectors.

In the embodiment of the disclosure, the semantic comprehension device is used for coding the related word sequence of the image to be processed, so that irrelevant semantic information in the related text of the image to be processed can be filtered, missing semantic information can be inferred, and the accuracy of the generated image description text can be improved when the image description text is generated based on the related semantic features in the subsequent process.

FIG. 3 is a schematic flow diagram of a semantic solver through training, according to some embodiments of the present disclosure. As shown in fig. 3, the process of obtaining a semantic solver by training according to the embodiment of the present disclosure includes:

step S310: and acquiring a related word sequence of the sample image.

In some embodiments, the related texts of the sample image are obtained through cross-modal retrieval, and the related word sequence of the sample image is determined according to the related texts of the sample image.

In some embodiments, the related text of the sample image is one or more sentences, and the related word sequence of the sample image is obtained by removing stop words and the like from the related sentences.

Step S320: and training the neural network model based on the attention mechanism according to the related word sequence of the sample image and a preset loss function to obtain a semantic solver.

The loss function is constructed by taking semantic words which are irrelevant to the sample image in the relevant word sequence of the filtered sample image and relevant semantic words which are missing in reconstruction as targets. The process of optimizing the model based on this loss function can be expressed as a combination of single-label classification and multi-label classification problems.

In some embodiments, step S320 includes: step b1 to step b 5.

And b1, splicing the related word sequence of the sample image with the initialized memory parameters to obtain an input word sequence.

The related word sequence of the sample image is composed of feature vectors of a plurality of related words. Wherein the memory parameters are a set of learnable query parameters (i.e., a set of slots), which are initialized randomly before the model training begins and updated as the model is iteratively learned. And after the model training is finished, storing the final query parameters.

And b2, inputting the input word sequence into a neural network model based on an attention mechanism to obtain an output semantic feature. Wherein the output semantic features include a plurality of semantic word features.

And b3, performing linear layer projection on the output semantic features to determine the probability distribution of each semantic word feature in the output semantic features on the semantic vocabulary.

In some embodiments, the semantic vocabulary consists of all semantic words in the training text set plus an identifier representing an unrelated semantic word.

In this step, the semantic features finally output by the semantic comprehension device

And as a condition, estimating the probability distribution of each semantic word characteristic on the semantic vocabulary by using a linear predictor so as to obtain semantic prediction. Specifically, each semantic word feature in the output semantic features can be projected to a vector in D dimension respectively directly using a linear layer, where D is the size of a predefined semantic vocabulary, and each vector is the probability distribution of each semantic word feature on the whole semantic vocabulary

And b4, calculating the value of the loss function according to the probability distribution of each semantic word feature in the output semantic features on the semantic vocabulary.

In some embodiments, the process of filtering out irrelevant semantic words in the relevant word sequence is regarded as a single-label classification task, and a cross-entropy loss function may be used as a loss function corresponding to the single-label classification task.

Obtaining probability distribution of each semantic word in related word sequence

Then, a first loss value is calculated based on the cross entropy loss function:

wherein L is _x Which represents the value of the first loss to be,

and

respectively represent y _i And

c denotes the category, y _i Is the real label representation of the ith semantic word.

In some embodiments, the process of inferring the missing relevant semantic words is considered a multi-label classification task, and the corresponding loss function of the multi-label classification task may employ an asymmetric loss function.

In obtaining the feature vector of the memory parameter

Corresponding probability distribution

Then, based on sigmoid activation function pair probability distribution

Normalization is carried out, and then maximum pooling is carried out on the normalized memory parameters and the normalized memory parameters so as to obtain the integral probability distribution of the memory parameter feature vector on the semantic vocabulary

Next, a second loss value is calculated based on the asymmetric loss function:

wherein L is _m Representing the second penalty value, asym representing the asymmetric penalty function, y _m Is the true label of all missing related semantic words.

After the first loss value and the second loss value are obtained, the value of the total loss function is calculated according to the first loss value and the second loss value. Illustratively, the total loss value is calculated according to the following formula:

L _s ＝L _x +L _m

wherein L is _s The total loss value is indicated.

And b5, optimizing the neural network model based on the attention mechanism according to the value of the loss function to obtain the semantic solver.

In the embodiment of the disclosure, the semantic solver is trained through the above steps, so that the performance of the semantic solver can be improved.

FIG. 4 is a flow diagram illustrating encoding of text associated with an image according to further embodiments of the present disclosure. The flow shown in fig. 4 is another exemplary implementation of step S150. As shown in fig. 4, a process of encoding relevant text of an image according to an embodiment of the present disclosure includes:

step 151; and determining a related word sequence of the image to be processed according to the related text of the image to be processed.

In some embodiments, the relevant text of the image to be processed is one or more sentences. And removing stop words and the like from the sentences so as to obtain a related word sequence of the image to be processed.

Step S152: and coding the related word sequence based on a semantic solver to output semantic features.

Step S153: and fusing the semantic features and the position coding features output by the semantic manager based on a semantic sequencer to obtain the related semantic features of the image to be processed.

In some embodiments, step S153 includes: step c1 and step c 2.

And c1, determining the position code participated by each semantic word feature in the semantic features output by the semantic comprehension device based on the semantic sequence device.

In some embodiments, for each semantic word feature, determining an attention distribution of the semantic word feature for all position encodings in the position encoding sequence; and according to the attention distribution, aggregating all position codes in the position code sequence to obtain the position codes participated by the semantic word features. For example, the position code in which each semantic word participates is calculated by the following formula:

wherein, P _i Position coding, p, indicating the participation of semantic words _i Can be interpreted as each semantic word feature in the semantic features

A "soft" estimation of the linguistic order of (a).

And c2, fusing the semantic word features and the position codes participated in the semantic word features to obtain fused semantic word features, and taking the whole formed by all the fused semantic word features as the relevant semantic features of the image to be processed.

For example, the fused semantic word features are obtained by the following formula:

wherein,

the feature of the fused semantic word is represented,

representing semantic word features, p _i Indicating the position coding in which the semantic word participates.

Using the whole formed by all the fused semantic word characteristics as the related semantic characteristics of the image to be processed, namely

To present a sequence of ordered semantic words.

In the embodiment of the present disclosure, the above processing by the semantic solver and the semantic sequencer obtains the more relevant semantic features of the location perception. Further, the more relevant semantic features of location awareness as an additional language prior can encourage the generation of relevant and coherent descriptions while improving the accuracy of the generated image description statements, thereby contributing to the improvement of grammatical consistency of the generated image description statements.

FIG. 5 is a flow diagram of a model training method according to some embodiments of the present disclosure. As shown in fig. 5, the model training method of the embodiment of the present disclosure includes:

step S510: and extracting the visual features of the sample image.

The training data set includes a sample image and a text corresponding to the sample image. In some embodiments, the text corresponding to each sample image I is a sentence, which may be represented as S ═ w ₀ ,w ₁ ,…,w _T-1 And T denotes the length of the sentence.

In some embodiments, step S510 includes: extracting local features and global features of the sample image; and determining the visual characteristics of the sample image according to the local characteristics and the global characteristics of the sample image.

In some embodiments, local features (such as grid features) and global features of the sample Image are extracted using a text-Image-contrast Pre-Training (CLIP) model. Specifically, local features and global features of the sample image are extracted using an image encoder in the CLIP model.

In other embodiments, local features and global features of the sample images are extracted using a pre-trained object detector or classifier.

In some embodiments, determining the visual characteristic of the sample image from the local characteristic and the global characteristic of the sample image comprises: mapping the local features and the global features of the sample image to a new feature space, and splicing the mapped local features and the mapped global features; and coding the spliced image features based on a visual coder to obtain the visual features of the sample image, wherein the visual coder is a trained neural network model stacked with a plurality of layers of coding blocks adopting a self-attention mechanism.

Step S530: and acquiring related text of the sample image.

In some embodiments, a similarity of the sample image to text in the training dataset is determined; and selecting relevant texts of the sample image from the texts of the training data set according to the similarity.

Step S550: and coding the related texts of the sample images to obtain the related semantic features of the sample images.

In some embodiments, a related word sequence of the sample image is determined according to the related text of the sample image; and coding the related word sequence based on a semantic solver to obtain related semantic features of the sample image, wherein the semantic solver is a trained neural network model based on an attention machine system.

In other embodiments, the related word sequence of the sample image is determined according to the related text of the sample image; and coding the related word sequence based on a semantic solver, and fusing semantic features output by the semantic solver and position coding features based on a semantic sequencer to obtain related semantic features of the sample image.

Step S570: and carrying out supervised training on the neural network model based on the attention mechanism according to the visual characteristics of the sample images and the related semantic characteristics of the sample images to obtain the text decoder. Wherein the text decoder is used for generating the image description text.

In some embodiments, step S570 comprises: performing feature fusion on the text features input at the current decoding moment and the predicted descriptor of the sample image based on an attention mechanism to obtain a first semantic feature; carrying out semantic enhancement on the text features input at the current decoding moment based on a cross attention mechanism under the assistance of the visual features of the sample images and the related semantic features of the sample images to obtain second semantic features; fusing the first semantic features and the second semantic features to obtain fused semantic features; determining probability distribution of each semantic word feature in the text features input at the current decoding moment according to the fused semantic features; determining the value of the loss function according to the probability distribution; and training the model according to the value of the loss function to obtain the text decoder.

In the embodiment of the disclosure, the performance of the generated text decoder can be improved through the above steps, and the accuracy and the grammar consistency of the image description text generated based on the text decoder are improved.

Fig. 6 is a schematic structural diagram of a text generation apparatus according to some embodiments of the present disclosure. As shown in fig. 6, the text generation apparatus according to the embodiment of the present disclosure includes: the system comprises a feature extraction module 610, a text acquisition module 620, a text coding module 630 and a generation module 640.

A feature extraction module 610 configured to extract visual features of the image to be processed.

The feature extraction module 610 encodes the image to be processed to obtain the visual features of the image to be processed.

Therein, visual features of an image may be represented by an ordered set of numerical values having fixed dimensions, such as vectors. For example, a visual characteristic of an image is represented as

Wherein,

And a text acquisition module 620 configured to acquire a text related to the image to be processed.

The text encoding module 630 is configured to encode the related text of the image to be processed to obtain the related semantic features of the image to be processed.

In some embodiments, the text encoding module 630 encodes the relevant sentences of the image to be processed to obtain the characteristics of the relevant sentences. Wherein the features of the relevant sentences can be represented by an ordered set of values with fixed dimensions, such as a vector. For example, a relevant sentence is characterized as

Wherein,

The generating module 640 is configured to generate a description text of the image to be processed according to the visual features of the image to be processed and the related semantic features of the image to be processed.

In some embodiments, the visual features of the image to be processed and the associated semantic features of the image to be processed are processed based on a text decoder to generate a descriptive text of the image to be processed. Wherein, the text decoder is a trained neural network model adopting an attention mechanism. For example, a trained neural network, stacked with a plurality of masked multi-attention-based decoding modules, is used as a text decoder.

In the embodiment of the disclosure, with the assistance of the relevant semantic features and visual features of the image to be processed, the dependency on the language prior knowledge in the training data can be reduced when the image description text is generated, and the accuracy and the grammar consistency of the generated image description text are improved.

FIG. 7 is a schematic diagram of a model training apparatus, according to some embodiments of the present disclosure. FIG. 7 is a schematic diagram of a model training apparatus, according to some embodiments of the present disclosure. As shown in fig. 7, the model training apparatus according to the embodiment of the present disclosure includes: a feature extraction module 710, a text acquisition module 720, a text encoding module 730, and a training module 740.

A feature extraction module 710 configured to extract visual features of the sample image.

The training data set includes a sample image and a text corresponding to the sample image. In some embodiments, the text corresponding to each sample image I is a sentence, which may be represented as S ═ w ₀ ,w ₁ ,…,w _T-1 And T represents the length of the sentence.

In some embodiments, the feature extraction module 710 extracts local features and global features of the sample image; the feature extraction module 710 determines the visual features of the sample image based on the local features and the global features of the sample image.

In some embodiments, the feature extraction module 710 extracts local features (such as grid features) and global features of the sample Image using a text Image contrast Pre-Training (CLIP) model. Specifically, local features and global features of the sample image are extracted using an image encoder in the CLIP model.

In other embodiments, the feature extraction module 710 extracts local features and global features of the sample images using a pre-trained object detector or classifier.

In some embodiments, the feature extraction module 710 determines the visual features of the sample image based on the local features and the global features of the sample image by: mapping the local features and the global features of the sample image to a new feature space, and splicing the mapped local features and the mapped global features; and coding the spliced image features based on a visual coder to obtain the visual features of the sample image, wherein the visual coder is a trained neural network model stacked with a plurality of layers of coding blocks adopting a self-attention mechanism.

A text obtaining module 720 configured to obtain relevant text of the sample image.

In some embodiments, the text acquisition module 720 determines the similarity of the sample images to the text in the training dataset; the text obtaining module 720 selects relevant texts of the sample image from the texts of the training dataset according to the similarity.

A text encoding module 730 configured to encode the related text of the sample image to obtain the related semantic features of the sample image.

In some embodiments, the text encoding module 730 determines a sequence of related words of the sample image according to the related text of the sample image; the text encoding module 730 encodes the related word sequence based on a semantic solver to obtain related semantic features of the sample image, wherein the semantic solver is a trained neural network model based on an attention mechanism.

In other embodiments, the text encoding module 730 determines a related word sequence of the sample image according to the related text of the sample image; the text coding module 730 codes the related word sequence based on a semantic solver; the text coding module 730 fuses the semantic features and the position coding features output by the semantic solver based on a semantic sequencer to obtain the relevant semantic features of the sample image.

A training module 740 configured to perform supervised training on the attention mechanism based neural network model according to the visual features of the sample images and the associated semantic features of the sample images to obtain a text decoder. Wherein the text decoder is configured to generate an image description text.

In the embodiment of the disclosure, the performance of the generated text decoder can be improved through the above device, and the accuracy and the syntax consistency of the image description text generated based on the text decoder are improved.

FIG. 8 is a block diagram illustrating a text generation apparatus or a model training apparatus according to further embodiments of the present disclosure.

As shown in fig. 8, the text generating apparatus 800 or the model training apparatus 800 includes a memory 810; and a processor 820 coupled to the memory 810. The memory 810 is used for storing instructions for executing the corresponding embodiment of the text generation method. The processor 820 is configured to perform a text generation method or a model training method in any of some embodiments of the present disclosure based on instructions stored in the memory 810.

Figure 9 is a block diagram illustrating a computer system for implementing some embodiments of the present disclosure.

As shown in fig. 9, computer system 900 may take the form of a general purpose computing device. Computer system 900 includes a memory 910, a processor 920, and a bus 930 that couples various system components.

The memory 910 may include, for example, system memory, non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs. The system memory may include volatile storage media such as Random Access Memory (RAM) and/or cache memory. The non-volatile storage medium stores, for example, instructions to perform corresponding embodiments of at least one of the text generation methods. Non-volatile storage media include, but are not limited to, magnetic disk storage, optical storage, flash memory, and the like.

The processor 920 may be implemented as discrete hardware components, such as general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gates or transistors, and so on. Accordingly, each of the modules such as the building module and the deviation compensating module may be implemented by a Central Processing Unit (CPU) executing instructions in a memory to perform the corresponding step, or may be implemented by a dedicated circuit to perform the corresponding step.

The bus 930 may use any of a variety of bus architectures. For example, bus architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.

Computer system 900 may be connected via bus 930 between these

interfaces

940, 950, 960 and memory 910 and processor 920. The input/output interface 940 may provide a connection interface for an input/output device such as a display, a mouse, a keyboard, and the like. The network interface 950 provides a connection interface for various networking devices. The storage interface 960 provides a connection interface for external storage devices such as a floppy disk, a usb disk, and an SD card.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable apparatus to produce a machine, such that the execution of the instructions by the processor results in an apparatus that implements the functions specified in the flowchart and/or block diagram block or blocks.

These computer-readable program instructions may also be stored in a computer-readable memory that can direct a computer to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instructions which implement the function specified in the flowchart and/or block diagram block or blocks.

The present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.

By the text generation method and the model training method and the device, the accuracy of the generated image description text can be improved.

Thus, text generation, model training methods and apparatus according to the present disclosure have been described in detail. Some details well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

Claims

1. A text generation method, comprising:

extracting visual features of an image to be processed;

acquiring a related text of the image to be processed;

coding the related text of the image to be processed to obtain the related semantic features of the image to be processed;

and generating a description text of the image to be processed according to the visual features of the image to be processed and the related semantic features of the image to be processed.

2. The text generation method according to claim 1, wherein the acquiring the text related to the image to be processed includes:

determining the similarity between the image to be processed and the existing texts in the training text set;

and selecting the related text of the image to be processed from the existing texts according to the similarity.

3. The text generation method of claim 2, wherein the determining the similarity of the image to be processed and the existing text comprises:

extracting global features of the image to be processed and global features of the existing text;

and calculating the cosine similarity of the global features of the image to be processed and the global features of the existing text, and taking the cosine similarity as the similarity between the image to be processed and the existing text.

4. The text generation method according to claim 1, wherein the encoding the text related to the image to be processed to obtain the semantic features related to the image to be processed comprises:

determining a related word sequence of the image to be processed according to the related text of the image to be processed;

and coding the related word sequence based on a semantic solver to obtain related semantic features of the image to be processed, wherein the semantic solver is a trained attention mechanism-based neural network model.

5. The text generation method of claim 4, wherein the semantic-based solver encoding the sequence of words associated with the image to be processed to obtain semantic features associated with the image to be processed comprises:

splicing the related word sequence of the image to be processed and additional memory parameters to obtain an input word sequence;

performing context coding on the input word sequence based on an attention mechanism to obtain semantic features fused with context information;

and carrying out semantic enhancement on the semantic features fused with the context information based on a cross attention mechanism under the assistance of the visual features of the image to be processed so as to obtain the related semantic features of the image to be processed.

6. The text generation method of claim 4, further comprising:

acquiring a related word sequence of a sample image;

training a neural network model based on an attention mechanism according to the related word sequence of the sample image and a preset loss function to obtain the semantic solver, wherein the loss function is constructed by taking semantic words which are irrelevant to the sample image in the related word sequence of the sample image and related semantic words which are missing in reconstruction as targets.

7. The text generation method of claim 6, wherein training a neural network model based on an attention mechanism according to the sequence of words related to the sample image and a preset penalty function comprises:

splicing the related word sequence of the sample image with the initialized memory parameter to obtain an input word sequence;

inputting the input word sequence into a neural network model based on an attention mechanism to obtain output semantic features, wherein the output semantic features comprise a plurality of semantic word features;

performing linear layer projection on the output semantic features to determine the probability distribution of each semantic word feature in the output semantic features on a semantic vocabulary;

calculating the value of a loss function according to the probability distribution of each semantic word feature in the output semantic features on a semantic vocabulary;

and optimizing a neural network model based on an attention mechanism according to the value of the loss function to obtain the semantic solver.

8. The text generation method according to claim 4, wherein the encoding the text related to the image to be processed to obtain the semantic features related to the image to be processed further comprises:

determining the position code participated by each semantic word feature in the semantic features output by the semantic solver;

and fusing the semantic word features and the position codes participated in the semantic word features to obtain fused semantic word features, and taking the whole formed by all the fused semantic word features as the related semantic features of the image to be processed.

9. The text generation method of claim 8, wherein determining the position code in which each semantic word feature participates in the semantic features output by the semantic manager comprises:

for each semantic word feature, determining the attention distribution of all position codes of the semantic word feature in a position code sequence;

and according to the attention distribution, aggregating all position codes in the position code sequence to obtain the position codes participated by the semantic word features.

10. The text generation method of claim 1, wherein generating the description text of the image to be processed according to the visual features of the image to be processed and the associated semantic features of the image to be processed comprises:

and processing the visual features of the image to be processed and the related semantic features of the image to be processed based on a text decoder to obtain a description text of the image to be processed, wherein the text decoder is a trained neural network model adopting an attention mechanism.

11. The text generation method of claim 10, wherein processing the visual features of the image to be processed and the associated semantic features of the image to be processed based on a text decoder to obtain the description text of the image to be processed comprises:

performing feature fusion on text features input at the current decoding moment and predicted descriptors of the image to be processed based on an attention mechanism to obtain first semantic features;

performing semantic enhancement on the text features input at the current decoding moment based on a cross attention mechanism under the assistance of the visual features of the image to be processed and the related semantic features of the image to be processed to obtain second semantic features;

fusing the first semantic features and the second semantic features to obtain fused semantic features;

determining probability distribution of each semantic word feature in the text features input at the current decoding moment according to the fused semantic features;

determining the next descriptor of the image to be processed according to the probability distribution;

and after all the descriptors of the image to be processed are obtained, taking an ordered sequence formed by all the descriptors as a description text of the image to be processed.

12. The text generation method of claim 8, wherein the extracting visual features of the image to be processed comprises:

extracting local features and global features of an image to be processed;

and determining the visual characteristics of the image to be processed according to the local characteristics and the global characteristics of the image to be processed.

13. The text generation method of claim 12, wherein the local features and the global features of the image to be processed are extracted using a text image contrast pre-trained model.

14. The text generation method of claim 12, wherein determining the visual feature of the image to be processed according to the local feature and the global feature of the image to be processed comprises:

mapping the local features and the global features of the image to be processed to a new feature space, and splicing the mapped local features and global features;

and coding the spliced image features based on a visual coder to obtain the visual features of the image to be processed, wherein the visual coder is a trained neural network model stacked with a plurality of layers of coding blocks adopting a self-attention mechanism.

15. The text generation method of claim 13, wherein encoding the stitched image features based on a visual encoder to obtain the visual features of the image to be processed comprises:

coding the spliced image features based on the coding blocks of the multilayer self-attention mechanism to obtain the local features after multilayer coding and the global features after multilayer coding;

splicing and fusing the global features output by the coding blocks of each layer of the self-attention mechanism to obtain the overall global features;

and splicing the overall global features and the multilayer coded local features to obtain the visual features of the image to be processed.

16. A model training method, comprising:

extracting visual features of the sample image;

acquiring a related text of the sample image;

coding the related texts of the sample images to obtain the related semantic features of the sample images;

and carrying out supervised training on a neural network model based on an attention mechanism according to the visual features of the sample images and the related semantic features of the sample images to obtain a text decoder, wherein the text decoder is used for generating an image description text.

17. A text generation apparatus comprising:

the characteristic extraction module is configured to extract visual characteristics of the image to be processed;

the text acquisition module is configured to acquire a related text of the image to be processed;

the text coding module is configured to code the related text of the image to be processed to obtain the related semantic features of the image to be processed;

the generating module is configured to generate a description text of the image to be processed according to the visual features of the image to be processed and the related semantic features of the image to be processed.

18. A model training apparatus comprising:

a feature extraction module configured to extract visual features of the sample image;

a text acquisition module configured to acquire a text related to the sample image;

a text encoding module configured to encode the relevant text of the sample image to obtain relevant semantic features of the sample image;

a training module configured to perform supervised training on a neural network model based on an attention mechanism according to the visual features of the sample images and the related semantic features of the sample images to obtain a text decoder, wherein the text decoder is used for generating an image description text.

19. A text generation apparatus comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the text generation method of any of claims 1-15 based on instructions stored in the memory.

20. A model training apparatus comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the model training method of claim 16 based on instructions stored in the memory.

21. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the text generation method of any of claims 1 to 15, or the model training method of claim 16.