CN113468357B

CN113468357B - Image description text generation method and device

Info

Publication number: CN113468357B
Application number: CN202110823822.XA
Authority: CN
Inventors: 彭海朋; 刘冬瑶; 李丽香
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2023-07-11
Anticipated expiration: 2041-07-21
Also published as: CN113468357A

Abstract

The embodiment of the invention provides a method and a device for generating an image description text, which relate to the technical field of image processing, and the method comprises the following steps: detecting a target area where a target is located; calculating an average pixel value to obtain regional characteristics; extracting features of the first input information to obtain first hidden features; generating a weight coefficient of each target area; weighting and calculating pixel values of pixel points at the same position in each target area to obtain first output information; extracting features of the second input information to obtain second hidden features; obtaining an output word with highest output probability in a preset vocabulary; updating the first input information into information containing the first hidden feature, the obtained output word and the region feature, and returning to the step of obtaining the first hidden feature until the output text meets the output end condition, and determining the output text as the image description text. By applying the scheme provided by the embodiment of the invention, the accuracy of generating the image description text can be improved.

Description

Image description text generation method and device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for generating an image description text.

Background

The image description text refers to a text which is obtained by converting the image into the text and is used for describing the content of the image in the image. Converting an image into image description text helps the user understand the image content of the image, as the user may have difficulty understanding the image content of the image while viewing the image. In addition, for a user with vision impairment, the image may be converted into image description text, and then the image description text may be played in a form of voice play to help the user understand the image content of the image.

In the prior art, an LSTM (Long Short Term Memory, long and short term memory) algorithm is mainly adopted to describe images to be described, so that image description texts are obtained. The image description process by using the LSTM algorithm is a loop process, a word is generated in each loop, the input of the current loop is the image information of the image to be described and all the generated words in the previous loop, and the image description text obtained after the loop is finished comprises all the generated words. However, image information loss may occur over time during the loop, which may cause deviation between the content described in the finally generated image description text and the image content of the image to be described, and thus the accuracy of the generated image description text is low.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for generating image description text so as to improve the accuracy of image description. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for generating an image description text, where the method includes:

detecting a target area where a target in an image to be described is located;

calculating average pixel values of pixel points at the same positions in each target area to obtain area features containing the average pixel values;

based on a long-short-term memory LSTM feature extraction mode, performing feature extraction on the first input information to obtain a first hidden feature; generating a weight coefficient of each target area based on the characteristics of the target area and the first hidden characteristics; and carrying out weighted calculation on pixel values of pixel points at the same position in each target area based on the generated weight coefficient to obtain first output information, wherein the initial value of the first input information is as follows: the regional characteristics;

determining second input information as information containing the first hidden feature, first output information and second hidden feature, wherein an initial value of the second hidden feature is an empty feature;

Performing feature extraction on the second input information based on the LSTM feature extraction mode to obtain the second hidden feature; based on the second hidden feature, obtaining an output word with highest output probability in a preset vocabulary;

updating the first input information to information containing the first hidden feature, the obtained output word and the regional feature, returning to the LSTM feature extraction mode based on long and short time memories, extracting the features of the first input information to obtain the first hidden feature, and determining the output text as an image description text until the output text containing the obtained output word meets the preset output ending condition.

In a second aspect, an embodiment of the present invention provides an image description text generating apparatus, including:

the target detection module is used for detecting a target area where a target in the image to be described is located;

the average value calculation module is used for calculating average pixel values of pixel points at the same position in each target area to obtain area characteristics containing the average pixel values;

the first feature extraction module is used for carrying out feature extraction on the first input information based on a long-short-term memory LSTM feature extraction mode to obtain a first hidden feature; generating a weight coefficient of each target area based on the characteristics of the target area and the first hidden characteristics; and carrying out weighted calculation on pixel values of pixel points at the same position in each target area based on the generated weight coefficient to obtain first output information, wherein the initial value of the first input information is as follows: the regional characteristics;

The information determining module is used for determining second input information as information containing the first hidden feature, the first output information and the second hidden feature, wherein an initial value of the second hidden feature is an empty feature;

the second feature extraction module is used for extracting features of the second input information based on the LSTM feature extraction mode to obtain second hidden features; based on the second hidden feature, obtaining an output word with highest output probability in a preset vocabulary;

and the information updating module is used for updating the first input information into information containing the first hidden feature, the obtained output word and the regional feature, returning to the LSTM feature extraction mode based on long and short time memories, carrying out feature extraction on the first input information to obtain a first hidden feature, and determining the output text as an image description text until the output text containing the obtained output word meets the preset output ending condition.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

A memory for storing a computer program;

and the processor is used for realizing the steps of the image description text generation method according to any one of the first aspect when executing the program stored in the memory.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the image description text generation method steps of any one of the first aspects above.

The embodiment of the invention has the beneficial effects that:

in the scheme for generating the image description text provided by the embodiment of the invention, a target area where a target in an image to be described is located is detected; calculating average pixel values of pixel points at the same positions in each target area to obtain area characteristics comprising each average pixel point; taking the regional characteristics as initial values of the first input information, and carrying out characteristic extraction on the first input information based on an LSTM characteristic extraction mode to obtain first hidden characteristics; generating a weight coefficient of each target area based on the characteristics of the target area and the first hidden characteristics; based on the generated weight coefficient, carrying out weighted calculation on pixel values of pixel points at the same position in each target area to obtain first output information; determining the second input information as information comprising the first hidden feature, the first output information and the second hidden feature; performing feature extraction on the second input information based on the LSTM feature extraction mode to obtain a second hidden feature; based on the second hidden feature, obtaining an output word with highest output probability in a preset vocabulary; updating the first input information into information containing first hidden features, obtained output words and regional features, returning to a characteristic extraction mode based on long-short-term memory LSTM, extracting the characteristics of the first input information to obtain the first hidden features, and determining the output text as an image description text until the output text containing the obtained output words meets a preset output ending condition.

The target region comprises the target features, the region features are composed of average pixel points of pixel points at the same position of each target region, so that the region features can represent the overall image features of the image to be described, the completeness of image information is guaranteed, the features of the target region and the first hidden features are considered when the weight coefficient of the target region is generated, the first hidden features are obtained by extracting the features of the first input information, the initial value of the first input information is the region features, the first input information in the circulation process is the information containing the first hidden features, the obtained output words and the region features of the last circulation, and the generated weight coefficient can accurately reflect the importance degree of the target region in the current circulation process in each circulation process, and further the text context conforming to the text formed by the obtained output words and the output words of the image to be described are obtained in each circulation process. Therefore, the scheme provided by the embodiment of the invention can be applied to generate the image description text, so that the accuracy of the generated image description text can be improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other embodiments may be obtained according to these drawings to those skilled in the art.

Fig. 1 is a flowchart of a first image description text generation method according to an embodiment of the present invention;

fig. 2 is a flowchart of a second method for generating an image description text according to an embodiment of the present invention;

fig. 3 is a flowchart of a third method for generating an image description text according to an embodiment of the present invention;

fig. 4a is a schematic flow chart of a fourth image description text generation method according to an embodiment of the present invention;

fig. 4b is a schematic structural diagram of a text generation model according to an embodiment of the present invention;

fig. 5 is a flowchart of a fifth method for generating an image description text according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a first image description text generating device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a second image description text generating device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a first electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by the person skilled in the art based on the present invention are included in the scope of protection of the present invention.

Referring to fig. 1, a flow chart of a first image description text generation method is provided, which includes the following steps S101-S106.

Step S101: and detecting a target area where a target in the image to be described is located.

Specifically, the target area where the target is located can be determined by performing target recognition on the image to be described. The object recognition of the image to be described can take many forms. For example, object recognition may be implemented by a matting method, an object segmentation method, a saliency detection method, a depth estimation method, or the like. In addition, a neural network model for performing target recognition can be trained in advance to realize target recognition.

Step S102: and calculating the average pixel value of the pixel points at the same position in each target area to obtain the area characteristics containing each average pixel value.

In the step S101, the pixel points belonging to the target area in the image to be described are determined when the target area where the target is located in the image to be described is detected. The position of the pixel point in the image to be described can be expressed as the row number and the column number of the pixel point in the image to be described, and the position of the pixel point in the target area can be also expressed as the row number and the column number of the pixel point in the target area. The pixels at the same positions in each target area, that is, the pixels with the same number of rows and columns in each target area.

The average pixel value refers to a pixel value obtained by performing average calculation on pixel values of a plurality of pixel points.

The above-mentioned regional feature may be a numerical matrix, where the numerical values of each row and each column in the numerical matrix are average pixel values of the pixel points at the same position in each target region, and the positions of each numerical value in the numerical matrix correspond to the positions of the pixel points in the process of calculating the average pixel values one by one.

For example, if step S101 detects 3 target areas, where the pixel values of the pixels in the first row and the first column in the target areas are 5, 41, and 83, respectively, then the average pixel value of the pixels in the first row and the first column in the 3 target areas is calculated to be 43, and then the value of the first row and the first column in the area feature is 43.

Step S103: based on a long-short-term memory LSTM feature extraction mode, performing feature extraction on the first input information to obtain a first hidden feature; generating a weight coefficient of each target area based on the characteristics of the target area and the first hidden characteristics; and carrying out weighted calculation on pixel values of pixel points at the same position in each target area based on the generated weight coefficient to obtain first output information, wherein the initial value of the first input information is as follows: regional characteristics.

Generally, a plurality of features may be included in one information, and the extracted features of the information are different according to the different ways of extracting the features of the information, and the depth and the times of extracting the features are different, so that some information features can be easily extracted, for example, the shape of a football, the tail of a cat, the limbs of a person, and the like. But some information features are more difficult to extract, such as a person's emotion, interpersonal relationship, etc. Therefore, the first hidden feature is a feature hidden in the first input information obtained by extracting the feature of the first input information.

In one case, the weight coefficient of each target region is generated based on the feature of the target region and the first hidden feature, and may be generated according to the following expression:

α _t ＝softmax(a _t )

wherein alpha is _i,t Representing the weight coefficient, v _i The characteristics of the target region are represented,

representing a first hidden feature->

W _va 、W _ha Respectively represent a first learning parameter, a second learning parameter and a third learning parameter, alpha _t Represents the weight coefficient after normalization processing, a _t Representing the weight coefficient before normalization processing, softmax () is a normalized exponential function.

In one implementation manner, the first learning parameter, the second learning parameter, and the third learning parameter may be preset parameters.

In another implementation manner, the LSTM feature extraction manner may be implemented by a pre-trained LSTM network model, where the first learning parameter, the second learning parameter, and the third learning parameter may be obtained when the LSTM network model is pre-trained.

And carrying out weighted calculation on pixel values of the pixel points at the same position in each target area based on the generated weight coefficients to obtain first output information, namely multiplying the pixel value of the pixel point in each target area by the weight coefficient corresponding to the target area to obtain multiplied target areas, and adding the pixel values of the pixel points at the same position in each multiplied target area to obtain the first output information.

The above-described weighted calculation of the first output information may be expressed as the following expression:

wherein, the liquid crystal display device comprises a liquid crystal display device,

represents first output information, K represents the number of target areas, alpha _i,t Representing the weight coefficient, v, corresponding to the ith target area _i Representing the ith target area feature.

Step S104: and determining the second input information as information containing the first hidden feature, the first output information and the second hidden feature, wherein the initial value of the second hidden feature is a null feature.

The initial value of the second hidden feature is a null feature, that is, the initial value of the second input information is information including only the first hidden feature and the first output information.

Step S105: performing feature extraction on the second input information based on the LSTM feature extraction mode to obtain a second hidden feature; and based on the second hidden characteristic, obtaining the output word with the highest output probability in the preset vocabulary.

The second hidden feature is a feature hidden in the second input information obtained by extracting the feature of the second input information, similar to the step S103 described above.

Based on the second hidden feature, the probability that each word in the preset vocabulary is used as an output word can be obtained according to a preset probability calculation mode, so that the word with the highest probability is selected as the output word.

In one implementation manner, the above-mentioned preset probability calculation manner may be expressed as the following expression:

wherein p (y _t |y _1:t-1 ) Representing the conditional probability distribution of the output word, y _t Representing a word vector corresponding to the output word, y _1:t-1 Representing word sequences of previously derived output word compositions, W _p Representing the preset weight coefficient of learning,

representing a second hidden feature, b _p Representing a preset offset, p (y _1:T ) Representing the probability of outputting a word, y _1:T Representing a word sequence consisting of output words and previously obtained output words, T representing the number of words in the preset vocabulary.

Step S106: updating the first input information to information containing the first hidden feature, the obtained output word and the regional feature, and returning to step S103 until the output text containing the obtained output word meets the preset output end condition, and determining the output text as the image description text.

In this scheme, the process of generating the image description text is a loop process, each time steps S103-S105 are performed, an output word is generated, and each loop generates new first hidden features and second hidden features at the same time.

The above LSTM feature extraction manner can be expressed as the following expression:

h _t ＝LSTM(x _t ,h _t-1 )

H _t ＝LSTM(x _t ,H _t-1 )

wherein, when the LSTM feature extraction method represented by the above expression is applied to step S103, h _t Representing a first hidden feature, x, during the current cycle _t Representing first input information during the current cycle, h _t-1 Representing the first hidden feature during the last cycle, H _t Representing the first hidden feature after feature reinforcement in the current circulation process, H _t-1 Representing the first hidden characteristic after characteristic reinforcement in the previous cycle; when the LSTM feature extraction method represented by the above expression is applied to step S105, h _t Representing a second hidden feature, x, during the current cycle _t Representing second input information during the current cycle, h _t-1 Representing a second hidden feature during the last cycle, H _t Representing a second hidden feature after feature enhancement in the current cycle, H _t-1 Representing a second hidden feature after feature enhancement during the previous cycle. t represents the number of loops, and LSTM () represents an operation of feature extraction based on LSTM.

The feature enhancement refers to that the hidden features obtained in the previous cycle process are overlapped, so that the hidden features obtained after the overlapping contain information contained in all the previous hidden features.

The updated first input information may represent the following expression:

representing the first input information->

Representing the second hidden feature obtained during the last cycle,/->

Representing regional characteristics, W _e Representing a word embedding matrix corresponding to a preset vocabulary, and pi _t And (3) single-hot encoding the obtained output word.

The meeting of the preset output end condition may be that the number of words of the output text reaches a preset threshold; the integrity detection may be performed on the output text to detect that the output text is a complete text.

In addition, a word vector indicating the end may be set in the word embedding matrix, and in this case, step S105 may obtain the word vector having the highest output probability in the word embedding matrix, thereby obtaining the output word having the highest output probability in the vocabulary corresponding to the word embedding matrix. When the word vector obtained in step S105 is a word vector indicating an end, it indicates that a preset output end condition is satisfied.

Therefore, in the scheme provided by the embodiment of the invention, the target area contains the characteristics of the target, and the area characteristics consist of the average pixel points of the pixel points at the same position of each target area, so that the area characteristics can represent the integral image characteristics of the image to be described, the integrity of the image information is ensured, the characteristics of the target area and the first hidden characteristics are considered when the weight coefficient of the target area is generated, the first hidden characteristics are obtained by extracting the characteristics of the first input information, the initial value of the first input information is the area characteristics, and the first input information in the circulation process is the information containing the first hidden characteristics, the obtained output words and the area characteristics of the last circulation process, so that the generated weight coefficient can accurately reflect the importance degree of the target area in the current circulation process, and further the text context conforming to the text formed by the currently obtained output words and the output words of the image information of the image to be described are obtained in the circulation process. Therefore, the scheme provided by the embodiment of the invention can be applied to generate the image description text, so that the accuracy of the generated image description text can be improved.

Compared with the image description text generation scheme in the prior art, the image description text generation scheme provided by the embodiment of the invention can effectively improve the description performance of the image description text. The defect of information loss existing in the prior art is overcome. In the scheme provided by the embodiment of the invention, the LSTM characteristic extraction mode is fused with the context information of the output text containing the obtained output word, and when the output word in the current circulation process is output, the actions of the first hidden characteristic and the second hidden characteristic are strengthened, so that the information in the previous circulation process is stored; the first hidden characteristic in the current circulation process is obtained by converting the first hidden characteristic and the second hidden characteristic in the previous circulation process, so that information cannot be lost in the process of generating the image description text, and more accurate description text is generated by generating the weight coefficient corresponding to the target area.

In an embodiment of the present invention, referring to fig. 2, a flowchart of a second method for generating an image description text is provided, and compared with the embodiment shown in fig. 1, in this embodiment, step S101 detects a target area where a target is located in an image to be described, and includes the following steps S101A-S101C.

Step S101A: and carrying out multi-layer convolution transformation on the image to be described to obtain a characteristic image.

The convolution transformation is generally used for extracting features of an image, and in this step, the image to be described is subjected to multi-layer convolution transformation, that is, the image to be described is subjected to multiple feature extraction, so that the image obtained after the multi-layer convolution transformation is the feature image corresponding to the image to be described.

In one case, the multi-layer convolution transformation of the image to be described can be implemented by using a pre-trained convolution neural network.

For example, when the VGGnet16 network is used to perform multi-layer convolution transformation on an image to be described, firstly, the size of the image to be described is transformed to a preset size (mxn), and then the image to be described with the transformed size is subjected to multiple convolution, activation and pooling operations through 13 convolution layers, 13 activation layers and 4 pooling layers included in the network, thereby obtaining a size of

Is a feature image of (1).

Step S101B: candidate regions in the feature image that contain features of the object are determined.

In one case, a region candidate network may be utilized to determine candidate regions in the feature map. In determining the candidate region in the feature map by using the region candidate network, the feature image obtained in step S101A is first subjected to convolution transformation by using a 3×3 convolution check to obtain a convolved image having the same size as the feature map. And respectively taking each pixel point in the convolved image as a datum point, selecting a plurality of areas with preset sizes, then performing full-connection operation on the convolved image by using a convolution kernel of 1 multiplied by 1, and judging the selected plurality of areas by using a softmax layer to judge whether the selected areas contain targets or not. And simultaneously, carrying out bounding box regression operation on the selected region, carrying out translation and scaling size correction on the bounding box, and finally, integrating the region containing the target and the frame regression offset to determine the candidate region containing the characteristic of the target in the characteristic image.

In another case, the candidate region including the feature of the object in the feature image may be determined by determining whether the pixel value of the pixel point in the feature image is greater than a preset pixel value. In the feature image, a region composed of pixel points having pixel values larger than a preset pixel value is determined as a candidate region.

Step S101C: and determining the region corresponding to the candidate region in the image to be described as a target region where the target in the image to be described is located.

The pixel points in the image to be described and the pixel points in the feature image have a corresponding relationship, and usually, one pixel point in the feature image corresponds to one region in the image to be described, and the step S101B determines a candidate region in the feature image, that is, selects the pixel point in the candidate region, and according to the corresponding relationship between the image to be described and the pixel point in the feature image, determines a region corresponding to the pixel point in the candidate region in the image to be described, where the determined region in the image to be described is the target region where the target is located.

Therefore, in the scheme provided by the embodiment of the invention, the image to be described is subjected to multi-layer convolution transformation to obtain the characteristic image, the candidate region containing the characteristics of the target is determined in the characteristic image, and finally the region corresponding to the candidate region in the image to be described is determined as the target region where the target in the image to be described is located. The feature image is an image obtained after feature extraction of the image to be described, so that the feature image can clearly reflect the features of the target, namely, the candidate region of the features belonging to the target can be accurately determined in the feature image, and the corresponding relationship exists between the image to be described and the pixel points in the feature image, so that the candidate region containing the features of the target is determined in the feature image, and the region where the target is located can be determined in the image to be described by utilizing the corresponding relationship.

In an embodiment of the present invention, referring to fig. 3, a flowchart of a third method for generating an image description text is provided, and in comparison with the embodiment shown in fig. 2, in this embodiment, the step S101C determines, as a target area where a target in an image to be described is located, an area corresponding to a candidate area in the image to be described, and includes the following steps S101C1 to S101C2.

Step S101C1: and performing region scaling processing on the candidate region to obtain a first region with a first preset size.

Performing region scaling processing on the candidate region, that is, adjusting the size of the candidate region, wherein two cases are included, one case is that the size of the candidate region is adjusted to be larger, that is, the candidate region is enlarged, and the method of upsampling, interpolation and the like can be used for realizing the method; another case is to adjust the size of the candidate region to a smaller size, that is, to make the candidate region smaller, which may be achieved by downsampling, pooling, or the like. In combination with these two cases, the candidate region may be adjusted to a first region of a first predetermined size using the various methods mentioned above.

Step S101C2: and carrying out maximum pooling treatment on the first area to obtain a second area with a second preset size, and determining an area corresponding to the second area in the image to be described as a target area where a target in the image to be described is located.

First, the first area is divided into a plurality of subareas, and in one subarea, the pixel point with the largest pixel value is reserved. After the operation, each sub-region can determine a pixel point with the maximum pixel value, and the region formed by the pixel points is the second region.

For example, if the first area with the first preset size of 16×8 and the second area with the second preset size of 4×4 are obtained in the step S101C1, in this step, the first area may be first divided into 16 sub-areas with the same size, each of which has a size of 4×2, and the pixel point with the largest pixel value in the sub-area is reserved, so as to finally obtain a 4×4 second area composed of 16 pixel points.

The second region is obtained by performing region scaling processing and maximum pooling processing on the candidate region, so that a corresponding relation exists between the pixel points in the second region and the pixel points in the candidate region, and a corresponding relation exists between the pixel points in the image to be described and the pixel points in the feature image, and therefore, the corresponding relation exists between the pixel points in the second region and the pixel points in the image to be described, and further, the region corresponding to the second region in the image to be described can be determined to be a target region where a target in the image to be described is located.

Therefore, in the scheme provided by the embodiment of the invention, the candidate region is converted into the second region with the same size through the region scaling treatment and the maximum pooling treatment, and the region corresponding to the second region in the image to be described is determined as the target region where the target in the image to be described is located. Since the size of the second region is fixed, the size of the target region corresponding to the second region in the image to be described is also fixed, so that each target region determined in the image to be described is the region with the same size, which is also convenient for calculating the average pixel value of the pixel points at the same position in each target region.

In an embodiment of the present invention, referring to fig. 4a, a flowchart of a fourth method for generating an image description text is provided, and compared with the embodiment shown in fig. 1, in this embodiment, step S103 performs feature extraction on first input information based on a long-short-term memory LSTM feature extraction manner, to obtain a first hidden feature; generating a weight coefficient of each target area based on the characteristics of the target area and the first hidden characteristics; based on the generated weight coefficients, the pixel values of the pixel points at the same position in each target area are weighted to obtain first output information, which includes the following step S103A.

Step S103A: inputting first input information into a first sub-model of a text generation model to obtain first hidden features and first output information, wherein the text generation model is a model which is obtained in advance and is used for generating an image description text, and the text generation model comprises: a first sub-model and a second sub-model, the first sub-model being a model employing a top-down attention mechanism.

In this step, the first input information is input into the first sub-model of the text generation model to obtain the first hidden feature and the first output information, that is, the feature extraction operation for the first input information, the weight coefficient generation operation for each target area, and the weighting calculation operation for the pixel values of the pixel points at the same position in each target area in step S103 are all completed in the first sub-model.

The top-down attention mechanism described above simulates the attention mechanism of a person's brain when viewing an image, as a person typically focuses attention on a prominent area where attention is most needed when obtaining external information, e.g., in basketball games, a spectator typically focuses attention on a player holding a ball. Therefore, the first sub-model adopting the top-down attention mechanism can set a weight coefficient for the target area in the image to be described according to the input first input information, so as to obtain the first output information.

Step S105 performs feature extraction on the second input information based on the LSTM feature extraction manner, to obtain a second hidden feature; based on the second hidden feature, an output word with the highest output probability in the preset vocabulary is obtained, which includes the following step S105A.

Step S105A: and inputting the second input information into a second sub-model to obtain a second hidden feature and an output word, wherein the second sub-model is a model obtained based on language model transformation, and the language model is a model for predicting the occurrence probability of the word in the output text according to the model input information.

In this step, the second hidden feature and the output word are obtained by inputting the second input information into the second sub-model of the text generation model, that is, the operation of extracting the feature of the second input information in step S105 and the operation of obtaining the output word with the highest output probability in the vocabulary based on the second hidden feature are completed in the second sub-model, as in step S103A.

Referring to fig. 4b, a schematic structural diagram of a text generation model is provided. In FIG. 4b, the first sub-model contains a Top-Down attention (LSTM) unit and an attention unit, wherein the Top-Down attention LSTM unit is used to receive the first hidden feature obtained during the previous cycle

Second hidden feature obtained during the last cycle->

Regional characteristics->

And the obtained output word W _e ∏ _t Outputting the first hidden feature in the current cycle

The Attend unit is used for receiving the first hidden characteristic and each target area { v }, in the current loop process ₁ ,…,v _k Output of first output information +>

The second sub-model contains Language (LSTM) sheetsA meta and softmax unit, wherein the language LSTM unit is used for receiving the first hidden feature +_in the current loop process>

Second hidden feature obtained during the last cycle->

First output information->

Outputting the second hidden feature during the current loop +.>

The softmax unit is for receiving the second hidden feature +_during the current cycle>

Outputting the output word y in the current cycle _t 。

When the MS COCO-2015 data set is used for experimental evaluation of the text generation model, training, verification and testing are respectively carried out by 8000 images, 1000 images and 1000 images by using the ratio of 8:1:1 aiming at English description text which simultaneously contains 5 manual labels at different angles in each image. Preprocessing data, removing punctuation marks in all manually marked standard texts in the MS COCO data set, and performing case-to-case conversion on all word letters in the texts.

Setting the maximum text length as 15, intercepting the standard text with the length exceeding 15 to the standard text with the length being 15, and carrying out zero padding operation on the standard text with the length being less than 15. When the vocabulary is constructed, the frequency threshold of the vocabulary is set to be 5, and if the occurrence frequency of the vocabulary in the sentence is greater than or equal to the threshold, the special symbol < UNK > is added to the vocabulary to represent the words with the occurrence frequency smaller than the threshold. We then represent the vocabulary in the vocabulary list in a single-hot encoding manner. Let i be the position of the word in the vocabulary, then the i-th position in the one-hot vector representation of the word is set to 1 and the other positions to 0, the vector dimension being determined by the vocabulary length.

The dimension of the hidden layer of the LSTM model and the embedding dimension of the text and image are also set to 512. The optimization function is SGD, the loss function is cross entropy loss function, the initial learning rate is set to 5e10-4, and in order to avoid local minimum problem, alleviate concussion phenomenon and shorten training time to accelerate convergence during training, the learning rate is reduced to 0.8 times after every 4 rounds, the batch size is set to 48, the maximum iteration period is set to 50 rounds of iterations, and batch normalization is added. After training is finished, when the verification set verifies, the model with the highest BLEU score is stored in a local storage device to be used as a final output model.

In the test process, the decoding part uses a cluster search algorithm and sets the beam size to 3, then 3 sentence fragments with highest probability are reserved at each decision point in the decoding process, the decoding process is ended when the decoding process is carried out to a character < end >, and the description sentence with the highest probability is selected from 3 complete sentences of the final decision point to be used as the final output of the LSTM model.

The scheme of the invention obtains 0.558,0.737,0.441 accuracy under three conditions of iterative round number epoch= {40,45,50}, and is greatly superior to training effect without using a attention mechanism and a context LSTM model.

Therefore, in the scheme provided by the embodiment of the invention, the generation of the image description text is completed by using the text generation model. Because the text generation model is a pre-trained model, the accuracy of the image description text description image generated by the first sub-model and the second sub-model of the text generation model is higher.

In an embodiment of the present invention, referring to fig. 5, a flowchart of a fifth image description text generating method is provided, and compared with the embodiment shown in fig. 1, in this embodiment, step S106 updates the first input information to information including the first hidden feature, the obtained output word and the region feature, and returns to step S103 until the output text including the obtained output word meets the preset output end condition, and the output text is determined as the image description text, including the following steps S106A to S106K.

Step S106A: updating the first input information to information containing the first hidden feature, the obtained output word and the regional feature, and returning to step S103 until the output text containing the obtained output word meets the preset output end condition.

As can be seen from the above step S106, each time the steps S103-S105 are executed, an output word is generated, and a plurality of output words are obtained by executing the steps S103-S105 for a plurality of times, until the output text containing the obtained output word meets the preset output end condition, the cycle is ended, and the final obtained output text is composed of all the obtained output words.

Step S106B: based on a word embedding matrix corresponding to a preset vocabulary, obtaining word embedding vectors corresponding to each output word in the output text in a word embedding mode.

The above-mentioned preset vocabulary corresponds to the word embedding matrix, words are included in the vocabulary, vectors are included in the word embedding matrix, and each word in the vocabulary corresponds to one vector of the word embedding matrix, that is, the vector in the word embedding matrix is the vector format of the words in the vocabulary.

The word embedding method is that a vector corresponding to the output word contained in the output text is found in the word embedding matrix, and the vector is determined to be a word embedding vector corresponding to the output word.

Step S106C: for each output word in the output text, the output word is encoded according to an encoding mode corresponding to the position information of the output word, and a first word vector containing the position encoding information of the output word is obtained, wherein the encoding mode is sine encoding or cosine encoding.

The positions of the output words in the output text include the output words of the odd number part and the words of the even number part, the output words of the output text, such as the first word, the third word, the fifth word, etc., with the word positions of the odd number belonging to the output words of the odd number part, and the output words of the output text, such as the second word, the fourth word, the sixth word, etc., with the word positions of the even number belonging to the output words of the even number part.

In this case, the output word of the odd part may be encoded in a sine encoding manner, and the output word of the even part may be encoded in a cosine encoding manner.

For example, if the dimension of the word embedding vector is d, and m different words are included in the output text, a matrix PE can be constructed _eEd The elements in the matrix are:

PE _(pos,2i) ＝sin(pos/10000 ^(2i/d) )

Wherein PE _(pos,2i) Representing elements in the matrix, pos representing the position information of the output word in the output text, i representing information for distinguishing between odd and even partial words in the output text.

In one case, the matrix is a position-coding matrix.

The output word is position coded, and the obtained first word vector can be expressed as:

p _1×d ＝pe+w _1×m PE _m×d

wherein p is _1×d Represents a first word vector, pe represents a word embedding vector corresponding to the output word, w _1×m Representing the corresponding one-hot encoded vector of the output word, PE _m×d Representing the matrix constructed as described above.

Step S106D: for each output word in the output text, multiplying the first word vector of the output word with a first preset matrix to obtain a first vector for representing the output word, multiplying the first word vector of the output word with a second preset matrix to obtain a second vector for representing the label of the output word, and multiplying the first word vector of the output word with a third preset matrix to obtain a third vector for representing the word meaning of the output word.

In one implementation manner, the first preset matrix, the second preset matrix and the third preset matrix may be preset matrices.

In another implementation manner, the steps S106A to S106K in this embodiment may be implemented by a pre-trained network model, where the first preset matrix, the second preset matrix, and the third preset matrix are matrices obtained by training the network model.

The first vector is used to represent the output word, i.e., the first vector is an expression of the vector form of the output word; the second vector is used to represent the label of the output word, that is, the second vector characterizes the class of the output word; the third vector is used to represent the word meaning of the output word, i.e., the third vector is an expression of the word meaning of the output word in the form of a vector.

Step S106E: and multiplying the first vector corresponding to the last output word in the output text by the second vector corresponding to each output word in the output text respectively to obtain a plurality of first calculation results.

The step S106D obtains a first vector, a second vector and a third vector corresponding to each output word in the output text, multiplies the first vector corresponding to the last output word in the output text by the second vector corresponding to one output word in the output text to obtain a first calculation result, and multiplies the first vector corresponding to each output word in the output text to obtain a first calculation result which is the same as the number of output words in the output text.

The first calculation result obtained by multiplying the first vector by the second vector may be expressed as the following expression:

score＝Q ^T K

wherein score represents the first calculation result, Q represents the first vector corresponding to the output word, and K represents the second vector corresponding to the output word.

Further, the obtained first calculation result may be normalized according to the following expression, to obtain a standard calculation result:

wherein score ¹ Representing the standard calculation result, d represents the dimension of the first word vector.

The normalized calculation result can be obtained by normalizing the standard calculation result by using a softmax function, and the normalized calculation result can be expressed as the following expression:

representing normalized calculation results corresponding to the ith output word in the output text,/th output word>

Representing the standard calculation result corresponding to the i-th output word in the output text,/and>

and representing the standard calculation result corresponding to the first output word in the output text.

Step S106F: and multiplying the first calculation result corresponding to each output word in the output text by a third vector to obtain a plurality of second calculation results, and adding the plurality of second calculation results to obtain a fourth vector.

The step S106E obtains a plurality of first calculation results, where each output word in the output text corresponds to a first calculation result, each output word corresponds to a third vector, the first calculation result corresponding to each output word is multiplied by the third vector to obtain a second calculation result corresponding to the output word, each output word corresponds to a second calculation result, and the obtained plurality of second calculation results are added to obtain a fourth vector.

For example, if the output text contains 2 output words, the first output word corresponds to a first calculation result and a third vector of 0.4,

The first calculation result and the third vector corresponding to the second output word are respectively 0.5,

The second calculation result obtained by multiplying the first calculation result corresponding to the first output word by the third vector is

The second calculation result obtained by multiplying the first calculation result corresponding to the second output word by the third vector is +.>

Adding the two second calculation results to obtain a fourth vector of +.>

Step S106G: multiplying the fourth vector by the preset word embedding matrix to obtain the probability that the words contained in the preset vocabulary are used as the supplementary words of the output text.

The supplemental word is added after the last character of the output text. For example, the output text is "weather today is good". "the supplementary word is" me ", the output text after adding the supplementary word is" weather today is good ". I am).

The preset word embedding matrix comprises vectors representing corresponding words in the vocabulary, the fourth vector is multiplied with the preset word embedding matrix, namely, each vector in the word embedding matrix is multiplied with the fourth vector, a numerical value is obtained after each vector in the word embedding matrix is multiplied with the fourth vector, the multiplied numerical value corresponds to the words in the preset vocabulary, and finally, the probability that the words contained in the preset vocabulary are used as supplementary words of the output text is generated according to the obtained numerical values.

In one implementation, the probability of a word being a supplemental word to the output text may be determined by calculating the ratio of the value corresponding to the word to the sum of all values.

For example, if the preset vocabulary includes three words, the values corresponding to the words are 21, 14, and 15, respectively, then based on the obtained values, the probability corresponding to the first word in the preset vocabulary is 21/(21+14+15) =0.42, the probability corresponding to the second word is 0.28, and the probability corresponding to the third word is 0.3.

Step S106H: based on the obtained word probabilities, complementary words of the output text are determined.

In this case, a word corresponding to a word probability having the highest probability among the word probabilities may be selected as a supplementary word of the output text.

In another case, one of the words corresponding to the plurality of word probabilities having the word probability greater than the preset probability threshold may be selected as the supplementary word of the output text.

In the third case, the word may be randomly determined as a supplementary word of the output text according to the word probability, wherein the greater the word probability, the greater the likelihood that the word corresponding to the word probability is determined as the supplementary word of the output text.

Step S106I: judging whether the output text added with the supplementary word meets the preset supplementary ending condition, if not, executing step S106J; if yes, step S106K is executed.

The supplementary completion condition may be a predetermined number of words, or may be a supplementary word indicating completion.

In addition, it may be determined whether the output text after adding the supplementary word satisfies the preset supplementary end condition, or whether the instruction for adding the supplementary word is received within the preset time interval, and step S106J may be executed only when the instruction for adding the supplementary word is received within the preset time interval, or step S106K may be executed.

Step S106J: the output text is updated to the output text with the supplementary word added, and the process returns to step S106B.

When the output text after adding the supplementary word does not meet the preset supplementary ending condition, the fact that the supplementary word needs to be added to the output text is indicated, at this time, the output text is updated to the output text after adding the supplementary word, and a word embedding matrix corresponding to the preset vocabulary is returned, so that a word embedding vector corresponding to each output word in the output text is obtained in a word embedding mode, that is, steps S106B-S106H are executed again on the basis of the output text after adding the supplementary word, and therefore new supplementary words are determined.

Step S106K: the output text after adding the supplementary word is determined as the image description text.

When the output text added with the supplementary words meets the preset supplementary ending condition, the fact that the supplementary words are not required to be continuously added into the output text at the moment is indicated, and therefore the output text added with the supplementary words is determined to be the image description text.

In one embodiment of the present invention, the above steps S106A-S106K may be implemented by a GPT-2 model. The GPT-2 model is a language model comprising multiple layers of transform decoders, employing a masked self-attention mechanism, with the output of the upper layer decoder being the input of the lower layer decoder. The model is used to process input data containing location information. The process of implementing the scheme provided by this embodiment using the GPT-2 model can be expressed as the following expression:

h ₀ ＝UW _e +W _p

wherein h is ₀ Representing input text, W, into the GPT-2 model _p Position-coding matrix, W, representing input text _e Word embedding matrix representing input text, U representing one-hot encoding of input text, h _l Representing decoding information, h, of an input text decoded by a layer I decoder _l-1 Representing decoding information of an input text decoded by a layer 1 decoder, i representing the number of decoder layers, n representing the number of decoders in the GPT-2 model, transform_block () representing the decoding operation performed on the input of the decoder, p (u) representing the probability of the predicted next word, h _n Representing decoding information obtained by decoding the input text by the n-th layer decoder.

Before the GPT-2 model is used to implement the steps S106A-S106K, the model needs to be pre-trained, and each sample in the data set is composed of word sequence { x }, when the model is subjected to supervised fine tuning training by using a labeled data set C ¹ ,…,x ^m The process of training the GPT-2 model using one sample can be expressed as the following expression:

/>

L ₃ (C)＝L ₂ (C)+λ×L ₁ (C)

wherein P (y|x ¹ ,…,x ^m ) The conditional probability of the class label y is represented,

representing parameters included in a transducer decoder, W _y Representing parameters contained in the linear output layer of the model to be trained, L ₁ (C) Representing a first predetermined loss function, l ₂ (C) Representing a second predetermined loss function, l ₃ (C) Represents a third preset loss function, λ represents an optimization parameter.

Therefore, in the scheme provided by the embodiment of the invention, under the condition that the output text containing the obtained output word meets the preset output ending condition, the supplementary word is added to the output text containing the obtained output word, and the output text added with the supplementary word is determined to be the image description text only when the output text added with the supplementary word meets the supplementary ending condition, so that the content of the image description text is richer while the description accuracy of the image description text is improved, and the described image content is more detailed.

Corresponding to the image description text generation method, the embodiment of the invention also provides an image description text generation device.

Referring to fig. 6, there is provided a schematic structural view of a first image description text generating apparatus, the apparatus comprising:

the target detection module 601 is configured to detect a target area where a target in an image to be described is located;

the average value calculating module 602 is configured to calculate average pixel values of pixel points at the same position in each target area, so as to obtain an area feature including each average pixel value;

a first feature extraction module 603, configured to perform feature extraction on the first input information based on the long-short-term memory LSTM feature extraction manner, to obtain a first hidden feature; generating a weight coefficient of each target area based on the characteristics of the target area and the first hidden characteristics; and carrying out weighted calculation on pixel values of pixel points at the same position in each target area based on the generated weight coefficient to obtain first output information, wherein the initial value of the first input information is as follows: regional characteristics;

an information determining module 604, configured to determine that the second input information is information including a first hidden feature, first output information, and a second hidden feature, where an initial value of the second hidden feature is a null feature;

A second feature extraction module 605, configured to perform feature extraction on the second input information based on the LSTM feature extraction manner, to obtain a second hidden feature; based on the second hidden feature, obtaining an output word with highest output probability in a preset vocabulary;

the information updating module 606 is configured to update the first input information to information including the first hidden feature, the obtained output word, and the regional feature, and trigger the first feature extraction module 603, until the output text including the obtained output word meets a preset output end condition, and determine the output text as an image description text.

In an embodiment of the present invention, referring to fig. 7, a schematic structural diagram of a second image description text generating device is provided, and in this embodiment, compared with the embodiment shown in fig. 6, the object detecting module 601 includes:

the image transformation submodule 601A is used for carrying out multi-layer convolution transformation on an image to be described to obtain a characteristic image;

a region determining submodule 601B, configured to determine a candidate region including a feature of the target in the feature image;

the target determining submodule 601C is configured to determine an area corresponding to the candidate area in the image to be described as a target area where the target in the image to be described is located.

In one embodiment of the present invention, the target determination submodule 601C is specifically configured to:

performing region scaling treatment on the candidate region to obtain a first region with a first preset size;

and carrying out maximum pooling treatment on the first area to obtain a second area with a second preset size, and determining an area corresponding to the second area in the image to be described as a target area where a target in the image to be described is located.

In one embodiment of the present invention, the first feature extraction module 603 is specifically configured to:

Inputting first input information into a first sub-model of a text generation model to obtain first hidden features and first output information, wherein the text generation model is a model which is obtained in advance and is used for generating an image description text, and the text generation model comprises: a first sub-model and a second sub-model, the first sub-model being a model employing a top-down attention mechanism;

the second feature extraction module 605 is specifically configured to:

and inputting the second input information into a second sub-model to obtain a second hidden feature and an output word, wherein the second sub-model is a model obtained based on language model transformation, and the language model is a model for predicting the occurrence probability of the word in the output text according to the model input information.

In one embodiment of the present invention, the information updating module 606 is specifically configured to:

Updating the first input information to information containing the first hidden feature, the obtained output word and the regional feature, and triggering the first feature extraction module 603 until the output text containing the obtained output word meets the preset output ending condition;

based on a word embedding matrix corresponding to a preset vocabulary, obtaining word embedding vectors corresponding to each output word in the output text in a word embedding mode;

for each output word in the output text, encoding the output word according to an encoding mode corresponding to the position information of the output word to obtain a first word vector containing the position encoding information of the output word, wherein the encoding mode is sine encoding or cosine encoding;

for each output word in the output text, multiplying a first word vector of the output word with a first preset matrix to obtain a first vector for representing the output word, multiplying the first word vector of the output word with a second preset matrix to obtain a second vector for representing a label of the output word, and multiplying the first word vector of the output word with a third preset matrix to obtain a third vector for representing the word meaning of the output word;

Multiplying a first vector corresponding to the last output word in the output text by a second vector corresponding to each output word in the output text respectively to obtain a plurality of first calculation results;

multiplying the first calculation result corresponding to each output word in the output text by a third vector to obtain a plurality of second calculation results, and adding the plurality of second calculation results to obtain a fourth vector;

multiplying the fourth vector with a preset word embedding matrix to obtain the probability that the words contained in the preset vocabulary are used as supplementary words of the output text;

determining a supplemental word of the output text based on the obtained word probabilities;

judging whether the output text added with the supplementary word meets a preset supplementary ending condition or not;

if not, updating the output text into the output text added with the supplementary words, returning to a word embedding matrix corresponding to the preset vocabulary, and obtaining word embedding vectors corresponding to the output words in the output text in a word embedding mode;

if yes, determining the output text added with the supplementary words as image description text. Therefore, in the scheme provided by the embodiment of the invention, under the condition that the output text containing the obtained output word meets the preset output ending condition, the supplementary word is added to the output text containing the obtained output word, and the output text added with the supplementary word is determined to be the image description text only when the output text added with the supplementary word meets the supplementary ending condition, so that the content of the image description text is richer, and the described image content is more detailed.

The embodiment of the present invention further provides an electronic device, as shown in fig. 8, including a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete communication with each other through the communication bus 804,

a memory 803 for storing a computer program;

the processor 801, when executing the program stored in the memory 803, implements the following steps:

detecting a target area where a target in an image to be described is located;

based on a long-short-term memory LSTM feature extraction mode, performing feature extraction on the first input information to obtain a first hidden feature; generating a weight coefficient of each target area based on the characteristics of the target area and the first hidden characteristics; and carrying out weighted calculation on pixel values of pixel points at the same position in each target area based on the generated weight coefficient to obtain first output information, wherein the initial value of the first input information is as follows: regional characteristics;

determining second input information as information comprising a first hidden feature, first output information and a second hidden feature, wherein an initial value of the second hidden feature is a null feature;

Performing feature extraction on the second input information based on the LSTM feature extraction mode to obtain a second hidden feature; based on the second hidden feature, obtaining an output word with highest output probability in a preset vocabulary;

updating the first input information into information containing first hidden features, obtained output words and regional features, returning to a characteristic extraction mode based on long-short-term memory LSTM, extracting the characteristics of the first input information to obtain the first hidden features, and determining the output text as an image description text until the output text containing the obtained output words meets a preset output ending condition.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of any of the image description text generation methods described above.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the image description text generation methods of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus, electronic devices, computer readable storage media and computer electronics embodiments, the description is relatively simple, as relevant to the description of method embodiments, as it is substantially similar to method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method of generating an image description text, the method comprising:

detecting a target area where a target in an image to be described is located;

2. The method according to claim 1, wherein detecting the target area where the target is located in the image to be described includes:

performing multi-layer convolution transformation on the image to be described to obtain a characteristic image;

determining candidate areas containing the characteristics of the target in the characteristic image;

and determining the region corresponding to the candidate region in the image to be described as a target region where the target in the image to be described is located.

3. The method according to claim 2, wherein determining the region of the image to be described corresponding to the candidate region as the target region of the image to be described where the target is located, comprises:

Performing region scaling on the candidate region to obtain a first region with a first preset size;

and carrying out maximum pooling treatment on the first region to obtain a second region with a second preset size, and determining a region corresponding to the second region in the image to be described as a target region where a target in the image to be described is located.

4. The method of claim 1, wherein the feature extraction is performed on the first input information based on a long-short-term memory LSTM feature extraction method to obtain a first hidden feature; generating a weight coefficient of each target area based on the characteristics of the target area and the first hidden characteristics; based on the generated weight coefficient, carrying out weighted calculation on pixel values of pixel points at the same position in each target area to obtain first output information, wherein the first output information comprises:

inputting the first input information into a first sub-model of a text generation model to obtain the first hidden feature and the first output information, wherein the text generation model is a model which is obtained through training in advance and is used for generating image description text, and the text generation model comprises: a first sub-model and a second sub-model, the first sub-model being a model employing a top-down attention mechanism;

The LSTM feature extraction mode is based on the second input information, and the second hidden feature is obtained; based on the second hidden feature, obtaining an output word with the highest output probability in a preset vocabulary, including:

and inputting the second input information into the second sub-model to obtain the second hidden feature and the output word, wherein the second sub-model is a model obtained based on language model transformation, and the language model is a model for predicting the occurrence probability of the word in the model output text according to the model input information.

5. The method according to any one of claims 1-4, wherein determining the output text as image description text until the output text containing the obtained output word meets a preset output end condition, comprises:

if the output text containing the obtained output words meets the preset output ending condition, obtaining word embedding vectors corresponding to the output words in the output text in a word embedding mode based on a word embedding matrix corresponding to the preset vocabulary;

for each output word in the output text, coding the output word according to a coding mode corresponding to the position information of the output word to obtain a first word vector containing the position coding information of the output word, wherein the coding mode is sine coding or cosine coding;

Multiplying a first word vector of each output word in the output text with a first preset matrix to obtain a first vector used for representing the output word, multiplying the first word vector of the output word with a second preset matrix to obtain a second vector used for representing the label of the output word, and multiplying the first word vector of the output word with a third preset matrix to obtain a third vector used for representing the word meaning of the output word;

multiplying the fourth vector by the preset word embedding matrix to obtain the probability that the words contained in the preset vocabulary are used as the supplementary words of the output text;

and if so, determining the output text added with the supplementary word as the image description text.

6. An image description text generation apparatus, the apparatus comprising:

the second feature extraction module is used for extracting features of the second input information based on the LSTM feature extraction mode to obtain the second hidden features; based on the second hidden feature, obtaining an output word with highest output probability in a preset vocabulary;

7. The apparatus of claim 6, wherein the first feature extraction module is specifically configured to: inputting the first input information into a first sub-model of a text generation model to obtain the first hidden feature and the first output information, wherein the text generation model is a model which is obtained through training in advance and is used for generating image description text, and the text generation model comprises: a first sub-model and a second sub-model, the first sub-model being a model employing a top-down attention mechanism;

The second feature extraction module is specifically configured to: and inputting the second input information into the second sub-model to obtain the second hidden feature and the output word, wherein the second sub-model is a model obtained based on language model transformation, and the language model is a model for predicting the occurrence probability of the word in the model output text according to the model input information.

8. The apparatus according to claim 6 or 7, wherein the information updating module is specifically configured to:

updating the first input information to information containing the first hidden feature, the obtained output word and the regional feature, returning to the LSTM feature extraction mode based on long and short time memory, and extracting the features of the first input information to obtain a first hidden feature until an output text containing the obtained output word meets a preset output ending condition;

based on the word embedding matrix corresponding to the preset vocabulary, obtaining word embedding vectors corresponding to each output word in the output text in a word embedding mode;

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-5 when executing a program stored on a memory.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-5.