CN113609326A

CN113609326A - Image description generation method based on external knowledge and target relation

Info

Publication number: CN113609326A
Application number: CN202110982666.1A
Authority: CN
Inventors: 李志欣; 陈天宇; 张灿龙
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2021-11-05
Anticipated expiration: 2041-08-25
Also published as: CN113609326B

Abstract

The invention discloses an image description generation method based on the relationship between external knowledge and an object, which comprises the following steps: 1) classifying the data set; 2) an external semantic knowledge extraction stage; 3) the step of extracting the characteristic of the target area by fast R-CNN; 4) the encoder processes the input characteristic stage; 5) the decoder processes the output stage from the encoder; 6) the test image describes a sentence phase. The method integrates visual relations, image position relations and semantic relations among different targets in the image, and excavates higher-layer more abstract characteristics in the image through the relations among the targets and human general knowledge, so that more vivid and accurate image description sentences are generated, and the relation excavation among the targets can be more sufficient and reasonable through multi-angle relation calculation.

Description

Image description generation method based on external knowledge and target relation

Technical Field

The invention relates to the technical field of image description generation, in particular to an image description generation method based on external knowledge and the relation between targets.

Background

With the popularization of network and digital equipment, various media image data are rapidly increased, and the automatic description generation of images has very wide application prospects, such as early education of infants, blind vision and the like. And the method relates to two fields of computer vision and natural language processing, and has very important research significance.

Image description generation has been a very active area of research since the 60's last century, and early technologies that have been more widely used have mainly included search-based methods and template-based methods. Based on the search, a certain image is given, an image similar to the image and a description sentence thereof are found from an existing image library, and the defect is obvious, namely the robustness for a new image is poor. The template-based method is to divide the sentence into templates such as < subject, predicate, object > and fill the sentence content according to the image content, and the main drawback of the method is that the generated sentence is rigid and inflexible.

Later, with the rapid development of deep learning, many innovative methods for applying the deep learning method to image description generation appeared. Inspired by natural language processing, li fei et al proposed a NIC model in 2015 that used an encoder-decoder model for NLP domain for image description generation, where the model encodes image information into a vector of fixed length, and passes it to a decoder to generate words one by one. However, the information of the whole image is not needed when all words are generated, in order to better refine the characteristics of the image for sentence generation, the attention mechanism is fused into the NIC model later, so that the model can focus on the most useful part of the image when different words are generated, the model not only greatly improves the performance of the model, but also raises the heat tide of fusing the attention mechanism into the image description work. Many of the latter works are improvements based on this, such as the semantic attention mechanism proposed by using the target semantic information in the image as the attention allocation object. In order to overcome the problem that the prior attention mechanism mostly uses uniform blocks of images as units to distribute attention weights, the later proposes a combined bottom-up and top-down attention mechanism, wherein the core innovation point is that a target detection module such as fast R-CNN is used for extracting target region characteristics, then the target characteristics are used as units to distribute attention in the bottom-up mechanism, the uniform blocks of the whole image are used as units to distribute attention in the top-down mechanism, a decoder part adopts a double-layer long-short term memory unit to combine hidden layer states generated by the two mechanisms, and the work is considered as a further milestone work in image description generation.

Later, as the Transformer model became popular in the NLP field, many Transformer-based image description methods were developed and showed better performance than most of the conventional methods, which compared to the Transformer model for natural language processing, improved the attention-based model at the input position coding and encoder section to better adapt to the model with image as input.

However, the current method cannot integrate the abstract high-level feature of the relationship between the objects of the images into the attention mechanism, and according to the common sense of people, the relationship between the objects in one image also contains important information, for example, when one image contains a football, people can appear around the football with a high probability, and how to utilize the semantic information which has important guiding significance for sentence generation is a very worthy of research.

Disclosure of Invention

The invention provides an image description generation method based on external knowledge and the relation between objects, aiming at the defect that the semantic relation between image objects cannot be effectively utilized by the traditional image description generation method. The method integrates visual relations, image position relations and semantic relations among different targets in the image, and excavates higher-layer more abstract characteristics in the image through the relations among the targets and human general knowledge, so that more vivid and accurate image description sentences are generated, and the relation excavation among the targets can be more sufficient and reasonable through multi-angle relation calculation.

The technical scheme for realizing the purpose of the invention is as follows:

the image description generation method based on the external knowledge and the relationship between the targets comprises the following steps:

1) classifying the data set: dividing the data set into two main classes, wherein the first main class is a knowledge data set for extracting external knowledge; the second large class is an experimental data set and is divided into 3 subclasses, namely a training data set, a verification data set and a test data set;

2) external semantic knowledge extraction stage:

2.1) obtaining 3000 categories with the highest occurrence frequency in the knowledge data set through a statistical algorithm, and then counting the frequency of the common occurrence of every two target categories to obtain a category co-occurrence probability matrix of 3000 times 3000;

2.2) selecting the top 200 attribute categories with the highest occurrence frequency in the knowledge data set to obtain attribute matrixes of 3000 target categories, wherein the dimensionality of each attribute matrix is 3000 multiplied by 200, and then calculating the JS divergence of every two categories as the attribute similarity of the categories to obtain an attribute similarity matrix of 3000 multiplied by 3000;

2.3) normalizing the category co-occurrence probability matrix and the attribute similarity matrix according to rows, namely adding the sum of each row to be 1;

2.4) carrying out category replacement on the experimental data sets, namely the category in each experimental data set has a number in a knowledge data set, wherein category co-occurrence probability matrix obtained in the step 2.1) and category information of the attribute similarity matrix obtained in the step 2.2) are represented by synnet;

3) the step of extracting the characteristic of the target area by fast R-CNN:

3.1) performing pre-training on a training data set of an experimental data set by using fast R-CNN on gitubs, extracting image features of the training data set through ResNet-101 in a pre-trained model, discarding the last two layers of full connection layers of ResNet-101, and taking the image features of the second layer from the last of ResNet-101 as the input of the next step;

3.2) generating candidate frames of the target area in the image and the category information of each candidate frame by the characteristics obtained in the step 3.1) through an RPN network, wherein the category is two categories, namely background information or foreground, the foreground is a target object, and deleting the candidate frames with the overlapping rate exceeding 0.7 by a non-maximum suppression method;

3.3) uniformly converting the remaining candidate boxes into vectors of 14 times 2048 through a RoI posing layer in the Faster R-CNN, and then inputting the vectors into an additional CNN layer to predict the category of each area box and a refined target area box;

3.4) generating 2048-dimensional feature vectors using average pooling as input to the encoder;

4) the encoder processes the input features stage:

4.1) reducing the 2048-dimensional image feature into 512 dimensions through a full connection layer, and passing the 512-dimensional image feature through a Relu activation layer and a dropout layer;

4.2) converting the input image features into three vectors of Q, K and V through three linear matrixes, and performing Multi-head calculation, wherein the number of heads of the Multi-head is set to 8, and the Multi-head calculation is used for calculating the similarity of the features of each target region in the next step;

4.3) calculating the visual similarity by the Q and K vector similarity calculation method in the traditional Transformer model for each 8 heads

4.4) converting coordinates of each pair of target frames of the image to obtain relative position information lambda (m, n) of the two targets, wherein the lambda (m, n) contains position relevance between the two targets;

4.5) embedding the lambda (m, n) obtained in the step 4.4) by a sinussoid function, and multiplying the embedded lambda (m, n) by a linear conversion layer W_GThen obtaining the image position similarity of the target m and the target n through a nonlinear activation function Relu

4.6) labeling and storing each target category detected by Faster R-CNN, finding corresponding rows in the category co-occurrence probability matrix and the attribute similarity matrix obtained in the step 2), and obtaining the semantic similarity of every two categories in the image

4.7) mixing

Fusion by softmax operation

Then is that

Adding a focus coefficient a, such that

The attention degree is (1-a), and a proper value a is obtained through a training process subsequently to obtain the similarity w of every two categories fusing the visual information, the position information and the external knowledge^mn；

4.8) similarity matrix w calculated using the Q, K vectors^mnMultiplying the V vectors to obtain weighted regional characteristics of target relationships fused into the image, and then connecting the 8V vectors;

4.9) inputting the V vector obtained in the step 4.8) into a feedforward neural network formed by two fully-connected layers after residual connection layers and normalization, then inputting the output of the feedforward neural network into the next layer of the encoder after the residual connection layers and normalization, and transmitting the output to the decoder after the operation of 6 layers of encoders in total;

5) the decoder processes the output stage from the encoder:

5.1) carrying out position coding on word information of a group route description sentence corresponding to each training picture in the training data set;

5.2) obtaining a weighted sentence word feature vector by the position-coded word vector obtained in the step 5.1) through Masked Multi-Head attachment, and taking the weighted sentence word feature vector as a V vector of Multi-Head self-Attention in the next step of the first layer;

5.3) converting the output of the last layer of the encoder into Q and K vectors through two linear conversion layers, and then carrying out multi-head self-attention operation on the Q and K vectors and the V vectors obtained in the step 5.2) to obtain the V vectors after the similarity information is fused;

5.4) transmitting the V vector obtained in the step 5.3) to a feedforward neural network after residual connecting layer and normalization, and taking the output as the input of the next layer of the decoder after one residual connecting layer and normalization;

5.5) the second layer decoder layer does not have Masked Multi-Head orientation operation as the first layer, but directly carries out Multi-Head self-Attention calculation, and the Q, K and V vectors are all the results output by the decoder of the previous layer and transformed by three linear matrixes;

5.6) after the operation of a total 6-layer decoder layer, the output vector passes through a linear layer and a softmax layer to obtain the corresponding probability vector of the next word;

6) the test image describes a sentence stage:

6.1) inputting a test set image, extracting image target area characteristics from the trained model and calculating similarity;

6.2) taking the image characteristics weighted by the similarity coefficient as the input of an encoder-decoder framework, and gradually outputting the description sentence word probability of each decoded image;

6.3) selecting the beamsize to be 2 by adopting a beamsearch method, finally obtaining the evaluation index score of each output sentence, and selecting the highest score as the test result.

And setting the number of images in the training set to be far greater than that of images in the test set for each data set, after each round of epoch training is finished, performing temporary effect verification by using images in the verification set, recording results, and storing checkpoint so that the training can be continued from the interrupted place next time in case of interruption of the training, and when the last step of the training is passed, selecting the model with the largest number of training rounds but selecting the middle model with the best effect.

Compared with the prior art, the technical scheme has the following characteristics:

(1) the Transformer method is innovatively introduced into the field of image description generation by using a Transformer framework-based encoder-decoder structure, and the result is found to be much better than that of the traditional method;

(2) external knowledge based on human general knowledge is introduced, and a plurality of objects in the image appear in pairs based on the human general knowledge, such as 'people' and 'football', and the accuracy of describing sentence words can be greatly improved and the sentence words are more anthropomorphic under the guidance of the external knowledge;

(3) different from a traditional method for calculating Q and K vector similarity by a Transformer architecture, the method not only calculates the visual relation of different areas, but also integrates the position relation and the semantic relation.

The method integrates visual relations, image position relations and semantic relations among different targets in the image, and excavates higher-layer more abstract characteristics in the image through the relations among the targets and human general knowledge, so that more vivid and accurate image description sentences are generated, and the relation excavation among the targets can be more sufficient and reasonable through multi-angle relation calculation.

Drawings

FIG. 1 is a schematic overall frame diagram of the embodiment;

fig. 2 is a schematic diagram of a self-attention computing process.

Detailed Description

The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.

Referring to fig. 1, an image description generation method based on external knowledge and a relationship between objects includes the steps of:

1) classifying the data set: the data sets are divided into two main classes, wherein the first main class is a knowledge data set Visual Genome for extracting external knowledge; the second major category is an experimental data set (such as MSCOCO, Flicker 8K and the like), and is divided into 3 subclasses, namely a training data set, a verification data set and a test data set; in the embodiment, Karpathy split in MSCOCO2014 is adopted for training, verifying and testing, wherein a training set comprises images and corresponding description sentences for training various parameters of the model, the verification set is used for verifying the training effect of the round after each epoch training is completed so as to save the best effect in all rounds, and the test set is used for obtaining the effect of the final model;

2) external semantic knowledge extraction stage:

2.1) obtaining 3000 categories with the highest occurrence frequency in the Visual Genome data set through a statistical algorithm, wherein the number of the categories of other data sets is within 3000 generally, then counting the frequency of the common occurrence of all the target categories in pairs through the relation branches in the Visual Genome data set to obtain a category co-occurrence probability matrix W with the ratio of 3000 to 3000_cls；

2.2) selecting the top 200 attribute categories with highest occurrence frequency in the Visual Genome data set to obtain attribute matrixes of 3000 target categories, wherein the dimensionality of the attribute matrixes is 3000 multiplied by 200, then calculating the JS divergence of every two categories as the attribute similarity of the categories to obtain an attribute similarity matrix W multiplied by 3000_att；

2.3) normalizing the category co-occurrence probability matrix and the attribute similarity matrix according to rows, namely adding the sum of each row to be 1, specifically, adding the value of each row to obtain the sum as a denominator, and then dividing the value of each row by the denominator to obtain a normalized result;

2.4) carrying out category replacement on the target data set, wherein the MSCOCO data set is taken as an example in this example, namely the categories in each MSCOCO data set have a serial number in Visual Genome, the MSCOCO data sets have 81 categories including the background, category names with similar meanings are replaced, for example, the ball in the Visual Genome is replaced by the sports _ ball in the MSCOCO, and the category co-occurrence probability matrix obtained in step 2.1) and the category information of the attribute similarity matrix obtained in step 2.2) are represented by synnet;

3) the step of extracting the characteristic of the target area by fast R-CNN:

3.1) performing pre-training on an MSCOCO data set by using fast R-CNN on github, extracting image features of the training data set through ResNet-101 in a pre-trained model, discarding the last two full-connected layers of ResNet-101, and taking the image features of the last layer of ResNet-101 as the input of the next step;

3.2) generating candidate frames of the target area in the image and the category information of each candidate frame by the input features of the step 3.1) through an RPN network, wherein the category is two categories, namely background information or foreground, and the foreground is a target object, and deleting the candidate frames with the overlapping rate exceeding 0.7 by a non-maximum suppression method;

4) the encoder processes the input features stage:

4.2) converting the input image features into three vectors of Q, K and V through three linear matrixes, performing Multi-Head calculation, and setting the number of heads of the Multi-Head to 8 for calculating the similarity of the features of each target region in the next step;

4.3) calculating the similarity of the Q and K vectors of all target region features of the image for each 8 heads, firstly calculating the visual similarity, namely performing point multiplication on each region feature vector and then normalizing to obtain a visual similarity matrix

4.7) mixing

Fusion by softmax operation

Then is that

Adding a focus coefficient a, such that

5) the decoder processes the output stage from the encoder:

6) the test image describes a sentence stage:

The external knowledge acquisition stage of the example mainly depends on the Visual Genome data set, and common knowledge of human beings can be regarded as the linguistics of explicit knowledge. The most representative explicit forms of knowledge may be attribute knowledge, such as apple is red, and pair-wise relationship knowledge, such as riding a bicycle. The first step in this example is to write a python script according to Visual GenoThe me official website data set branch downloads the attribute branch and the target relation branch, the format of the attribute branch is usually 'an apple of a red circle', a statistical algorithm is compiled to obtain the first 200 attributes with the highest occurrence frequency, a 3000-200 category attribute matrix is obtained, similarity of every two groups of vectors 1-200 is obtained through paired JS divergence calculation, KL divergence is different from that adopted by most conventional methods, KL divergence is not represented by a symmetric matrix, and symmetric 3000-3000 different measurement matrixes on attribute similarity are obtained through JS divergence and represent very high-order abstract semantic information. At the same time, according to the attribute class similarity matrix of the detected object, the similarity of the same kind of object is definitely higher than that of other objects, so that the accuracy of the model can be well improved by embedding the similarity information into the model_cls。

In addition to the attribute similarity information, there is also more direct semantic similarity information, i.e. a category co-occurrence probability similarity matrix, which directly represents the probability of the co-occurrence of two objects in an image, for example, according to the common sense of human beings, the probability of the simultaneous occurrence of a person and a vehicle in an image is higher than the probability of the co-occurrence of a horse and a vehicle in the same image, in this example, the co-occurrence probability of two classes of 3000 classes before the occurrence frequency is counted according to a Visual Genome data set, and this can be merged into the information of this matrix when the encoder calculates the similarity of different classes Q and K vectors. In the embodiment, the attribute similarity and the class co-occurrence probability similarity are averaged and then multiplied by an attention coefficient a, and the model obtains the optimal value of a in the training process, so that the proportion of the visual similarity, the position similarity and the semantic similarity among different targets can be reasonably distributed, and the co-occurrence probability similarity matrix obtained in the step is marked as W_att。

After semantic information representing human common sense is obtained, the present example replaces different data set categories, and because the attribute similarity matrix and the category co-occurrence probability matrix in the present example are all indexed by synnet with category labels on the Visual Genome data set, index replacement is performed on different data sets, and the syncnet of the same category is assigned to the categories with similar meanings on other data sets, so that the external knowledge of the present example is pertinently applied to the data sets in different directions.

For the encoder to process the input features from Faster R-CNN, the key step in this example is to first pass the input 512-dimensional features X through W_Q，W_k，W_VThe three transformation matrices are converted into three vectors of Q, K and V, and the formula is shown as (1):

Q＝W_QX，K＝W_KX，V＝W_V (1)，

the input X is a 512-dimensional vector, Q, K, V are 64-dimensional vectors, so the three transform matrix dimensions are 512 times 64. Then, in the stage of calculating the similarity of the Q and K vectors of the features of each region, the conventional method is to directly multiply the feature vectors of the two regions by a point and then divide the result by 64 as shown in formula (2):

wherein omega_AIs a matrix of N times N, the elements of the matrix being the coefficients of the relationship w between the target m and the target N^mnFor example, if there are 50 targets in an image, the value of N is 50, so the key of the next problem is to calculate the relationship coefficient between different categories, calculate the visual similarity of the two categories by using the method of formula (2), and then introduce the process of calculating the position similarity, where the position coordinate information obtained by fast R-CNN is (x, y, w, h), and these four values respectively represent the width w of the center point coordinate (x, y) image of the target frame and the height h of the image. The position coordinates are then transformed according to equation (3):

then, dimension embedding is carried out on the lambda (m, n), thought of the lambda (m, n) is converted into 64 dimensions, and then the lambda (m, n) is multiplied by a lineProperty matrix W_GAnd finally, activating by a nonlinear function Relu as shown in formula (4), so that the image position relationship similarity of the target m and the target n is obtained

The semantic similarity calculation stage follows, taking the MSCOCO dataset as an example, whose total number of classes is 81, thus extracting the co-occurrence probability matrix W in the extrinsic knowledge_clsAnd attribute similarity matrix W_attThe size of (1) is 81 times 81, assuming that the object class detected by Faster R-CNN is m and the other object classes in the image are n1 to n30, then the row of the object class corresponding to the two external knowledge matrices is found, and then the co-occurrence probability values corresponding to each of the other classes are found

And attribute similarity value

Then, an averaging process is performed as shown in formula (5):

then, integrating the multi-level relation information between the targets, and firstly, obtaining the visual similarity

And position similarity

Integrated by softmax operation into

The specific operation is shown in formula (6):

then is that

Adding a focus coefficient a, such that

The attention degree is (1-a), and a proper value a is obtained through a training process subsequently, so that the similarity w of every two categories of fused visual information, position information and external knowledge is obtained^mnAs shown in equation (7):

the method is characterized in that the similarity information of three relationships of vision, position and semantics is fused to measure the relationship between different targets, and then the similarity information is multiplied by the V vector of each target.

After each layer of the encoder obtains a V vector, the V vectors of 8 heads are also subjected to a cancel operation, and then after a residual connection and layer normalization, the obtained vectors are transmitted to the next layer of the encoder, the idea of the residual connection refers to classical ResNet, parameter transmission can be better assisted by the residual connection, the training speed is increased, the number of layers of the encoder is set to six, the previous layer takes the output of the next layer as input, that is, the V vector output by the previous layer is taken as an X vector to perform a self-attention operation, and the overall multi-head self-attention operation formula is shown as formula (8):

after the V vector output by each layer is obtained, the feedforward neural network layer is calculated according to elements, and the specific calculation formula is shown in (9):

FFN(x)＝max(0,xW₁+b₁)W₂+b₂ (9)，

w therein₁,W₂,b₁,b₂The weight parameters and the deviation of the two full-connection layers are respectively, and when the data calculation and transmission of the 6 layers are finished, the output of the last layer of the encoder is used as the input of the first layer of the decoder.

For the data processing stage of the decoder, each word in the real description sentence corresponding to each picture in the training set is encoded into a vector form. Step 1) the magnetic dictionary, which has more than one data set first, encodes each word in the form of a one-hot vector, but such a vector representation is too high in dimensionality to handle, so word embedding is required. Step 2) embedding high-dimensional word vectors into low-dimensional word vectors, embedding words into 512-dimensional vectors by using a word2vec method, because the words of the example are generated one by one, the information of a third word cannot be known in advance when a second word is generated, different from the Multi-Head self-Attention of an encoder, a Masked Multi-Head Attention is adopted at the first layer of a decoder, and the word information after each training time step is shielded by setting an upper triangular matrix, as shown in FIG. 2. And 3) taking the word vector after Masked Multi-Head attachment as a V vector, converting the output of the last layer of the encoder into a Q vector and a K vector after linear conversion, calculating the similarity of the Q vector and the K vector by adopting a formula (2) according to a traditional mode, multiplying the similarity by the V vector, transmitting the result to a feedforward neural network after residual connection and layer normalization, and taking the output of the feedforward neural network as the input of the next layer of the decoder. Step) the second layer of the decoder gets the output of the first layer, omits Masked Multi-Head Attention of the first layer, but directly refers to formula (2) to perform Multi-Head self-Attention calculation, takes the output of the previous layer as X, performs Linear conversion, calculates a similarity coefficient, multiplies a V vector, takes the V vector formed by connecting 8 heads as the input of the next layer, and outputs the probability vector of the next word after the V vector output by the last layer passes through a Linear layer and a softmax layer, as shown in FIG. 2.

In the testing stage, a test set picture of classic karpathy split of MSCOCO and a description sentence thereof are used for unfolding the test, 60 epochs are trained on the model in the training stage to obtain a final model and a best model, the final model and the best model are stored in a checkpoint, and then test indexes are BLEU1, BLEU2, BLEU3, BLEU4, Meter and Rouge-L, Spicer. In the testing stage, a brand-new image is given without the description sentence word guidance of a ground channel, a model generates a description sentence by itself, parameters obtained by training play a key role at this time, the trained parameters are equivalent to an experienced image processor which learns the relationships among a large number of images, the relationships among the image targets can be reasonably modeled and embedded into a classical transform frame, and the relationship representation of all the targets in each image can be obtained by means of residual connection, layer normalization and a feedforward neural network, so that the method has a strong guidance effect when a decoder decodes sentence word vectors.

The method is characterized in that position coding operation is required to be carried out on a first layer at both an encoder end and a decoder end, an image region with two-dimensional position information is coded for the encoder, because an image has two dimensions of length and width, the coding mode is different from the traditional transform position coding mode, after coding, the two-dimensional image region characteristic can be transformed into a one-dimensional expression mode similar to a sentence sequence, and a sine solid function mode is used for coding, wherein the formula is shown as formula (10):

pos in the formula (10) represents the position of the image region, i represents the dimension, each dimension corresponds to a sine solid, so that the two-dimensional image can be represented in a one-dimensional sequence form, the position coding mode at the decoder end is slightly different, and the input of the decoder is a one-dimensional word sequence originally, so that the i at the decoder end represents the actual position of the word in the whole sentence.

In a specific implementation stage, the experiment is carried out on a pytorch platform, training and testing of a model are completed through an Ubantu16.04 system loaded with an NVIDIA 1060ti display card, parameters of the model are set as follows, d is set to be 512 dimensions, and for an original input image, the original input image is extracted into 1024-dimensional feature vectors through ResNet in fast R-CNN and then is input to an encoder of the model. In encoding, the visual similarity, the positional similarity, and the semantic similarity coefficient are all set to 64 degrees. The blocksize in the training process is set to 10, 30 epochs are trained in a traditional cross entropy mode, then reinforcement learning training is carried out, and a loss function in the cross entropy training mode is shown in the following formula (11):

the goal of the cross-entropy training phase is to minimize the penalty function, i.e., to make the probability p in the above formula as close to 1 as possible, the meaning of the probability p in the formula being the probability of generating the next group struc word from the t-1 words previously generated. After 30 epochs are trained in a cross entropy mode, a reinforcement learning phase is started, namely, a sentence generation is regarded as a reinforcement learning problem by adopting a sampling method, the training aim is to maximize an incentive expectation function, and a formula is shown as (12):

where θ is the parameter of the model, after learning 30 epochs intensively, the model can be tested with the best results.

Claims

1. The image description generation method based on the relation between the external knowledge and the target is characterized by comprising the following steps:

2) external semantic knowledge extraction stage:

3) the step of extracting the characteristic of the target area by fast R-CNN:

3.1) performing pre-training on a training data set by using fast R-CNN on gitubs, extracting image features of the training data set through ResNet-101 in a pre-trained model, discarding the last two full-connection layers of ResNet-101, and taking the image features of the second layer from the last of ResNet-101 as the input of the next step;

3.2) generating candidate frames of the target area in the image and the category information of each candidate frame by the image characteristics obtained in the step 3.1) through an RPN network, wherein the category is two categories, namely background information or foreground, the foreground is a target object, and the candidate frames with the overlapping rate exceeding 0.7 are deleted through a non-maximum suppression method;

4) the encoder processes the input features stage:

4.2) converting the input image characteristics into three vectors of Q, K and V through three linear matrixes, and calculating Multi-head, wherein the number of the Multi-head is set to be 8;

4.3) for each of the 8 heads, calculating the visual similarity by the method in the traditional Transformer model to calculate Q and K vector similarity

4.7) mixing

Fusion by softmax operation

Then is that

Adding a focus coefficient a, such that

4.8) similarity matrix w calculated using the Q, K vectors^mnMultiplying by the V vectors to obtain weighted regional characteristics fused into the target relationship in the image, and then connecting the 8V vectors;

5) the decoder processes the output stage from the encoder:

6) the test image describes a sentence stage: