CN113609326B

CN113609326B - Image description generation method based on relationship between external knowledge and target

Info

Publication number: CN113609326B
Application number: CN202110982666.1A
Authority: CN
Inventors: 李志欣; 陈天宇; 张灿龙
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2023-04-28
Anticipated expiration: 2041-08-25
Also published as: CN113609326A

Abstract

The invention discloses an image description generation method based on the relationship between external knowledge and a target, which comprises the following steps: 1) Classifying a data set; 2) An external semantic knowledge extraction stage; 3) A stage of extracting target region characteristics by using Faster R-CNN; 4) The encoder processes the input feature phase; 5) The decoder processes the output phase from the encoder; 6) The test image describes the sentence phase. The method merges visual relations, image position relations and semantic relations among different targets in the image, and the higher-layer and more abstract features in the image are mined through the relations among the targets and common human knowledge, so that more vivid and accurate image description sentences are generated, and the multi-angle relation calculation can enable the relation mining among the targets to be more complete and reasonable.

Description

Image description generation method based on relationship between external knowledge and target

Technical Field

The invention relates to the technical field of image description generation, in particular to an image description generation method based on the relationship between external knowledge and targets.

Background

With the popularization of networks and digital devices, various media image data are rapidly increased, and the automatic description generation of images has very wide application prospects, such as early education of infants, vision of blind persons and the like. And it relates to two fields of computer vision and natural language processing, and has very important research significance.

Image description generation has become a very active area of research since the 60 s of the last century, and early applications of relatively broad technologies have included mainly search-based methods and template-based methods. Based on the searching, a certain image is given, the similar image and the description sentences thereof are found from the existing image library, and the defect is obvious, namely the robustness to the new image is poor. The method based on the template is to divide sentences into templates such as a subject, a predicate and an object, and fill the contents of the sentences according to the image contents after the sentences are divided into the templates, and the main defect of the method is that the generated sentences are relatively dead and inflexible.

With the rapid development of deep learning later, many innovative methods of applying deep learning methods to image description generation have emerged. Inspired by natural language processing, li Feifei et al in 2015 use encoder-decoder models for the NLP field in image description generation proposed NIC models that encode image information into a fixed length vector, which is then passed to a decoder to generate words one by one. But not the whole image information is needed when generating all words, in order to better refine the features of the image for sentence generation, attention mechanisms are fused into the NIC model, so that the model can pay attention to the most useful part of the image when generating different words, and the model not only greatly improves the performance of the model, but also lifts a hot tide for integrating the attention mechanisms into the image description work. Many of the latter works are improvements based on this, such as the appearance of semantic attention mechanisms that target semantic information in images as an object of attention allocation. In order to overcome the defect that the previous attention mechanism is mainly used for carrying out the distribution of attention weights in the unit of even blocks of an image, the later proposal is combined with the bottom-up and top-down attention mechanism, and the core innovation point is that a target detection module such as a Faster R-CNN is used for extracting target area characteristics, then the attention is distributed in the unit of the target characteristics in the bottom-up mechanism, the attention is distributed in the unit of even blocks of the whole image in the top-down mechanism, and a decoder part adopts a double-layer long-short-term memory unit to combine hidden layer states generated by the two mechanisms, so that the operation is considered to be a further milestone operation in the image description generation.

Later, as the transducer model became popular in the NLP field, many transducer-based image description methods developed in succession and exhibited better performance than most conventional methods, which improved the attention mechanism module in the input position coding and encoder section to better adapt to the image-input model than the transducer model for natural language processing.

However, the present method cannot integrate the abstract high-level feature of the relationship between the targets in the image into the attention mechanism, and according to the common sense of people, the relationship between the targets in one image also contains important information, for example, when one image contains football, people appear around football with high probability, and how to utilize the semantic information with important guiding significance for sentence generation is a very worthy problem.

Disclosure of Invention

Aiming at the defect that the traditional image description generation method cannot effectively utilize the semantic relation among image targets, the invention provides the image description generation method based on the external knowledge and the relation among the targets. The method merges visual relations, image position relations and semantic relations among different targets in the image, and the higher-layer and more abstract features in the image are mined through the relations among the targets and common human knowledge, so that more vivid and accurate image description sentences are generated, and the multi-angle relation calculation can enable the relation mining among the targets to be more complete and reasonable.

The technical scheme for realizing the aim of the invention is as follows:

the image description generation method based on the relationship between the external knowledge and the target comprises the following steps:

1) Data set classification: dividing the data set into two main categories, wherein the first main category is a knowledge data set for extracting external knowledge; the second major class is an experimental data set, which is divided into 3 sub-classes, namely a training data set, a verification data set and a test data set;

2) An external semantic knowledge extraction stage:

2.1 Obtaining 3000 categories with highest occurrence frequency in the knowledge data set through a statistical algorithm, and then counting the occurrence frequency of the two-by-two co-occurrence of all target categories to obtain a category co-occurrence probability matrix of 3000 times 3000;

2.2 Selecting the first 200 attribute categories with highest occurrence frequency in the knowledge data set to obtain attribute matrixes of 3000 target categories, wherein the dimension of the attribute matrixes is 3000 times 200, and then, calculating JS divergences of every two categories as attribute similarity of the categories to obtain an attribute similarity matrix of 3000 times 3000;

2.3 Normalizing the category co-occurrence probability matrix and the attribute similarity matrix according to rows, namely adding up the sum of each row to be 1;

2.4 The category replacement is carried out on the experimental data sets, namely, each category in each experimental data set has a number in the knowledge data set, wherein the category co-occurrence probability matrix obtained in the step 2.1) and the category information of the attribute similarity matrix obtained in the step 2.2) are represented by synnet;

3) Stage of extracting target region characteristics by Faster R-CNN:

3.1 Pre-training on a training data set of an experimental data set by using a fast R-CNN on a github, extracting image features of the training data set by using ResNet-101 in a pre-trained model, discarding the last two full connection layers of the ResNet-101, and taking the image features of the last two layers of the ResNet-101 as input of the next step;

3.2 Generating candidate frames of a target area in an image and category information of each candidate frame through an RPN (remote procedure network) by the characteristics obtained in the step 3.1), wherein the categories are respectively classified into background information or foreground, the foreground is a target object, and deleting the candidate frames with the overlapping rate exceeding 0.7 through a non-maximum suppression method;

3.3 Uniformly converting the rest candidate frames into vectors of which 14 is multiplied by 14 and 2048 through a RoI (row-level) layer in a fast R-CNN layer, and then inputting the vectors into an additional CNN layer to predict the category of each region frame and refine the target region frame;

3.4 Using averaging pooling to generate 2048-dimensional feature vectors as input to the encoder;

4) The encoder processes the input feature phase:

4.1 First, 2048-dimensional image features are reduced to 512-dimensional by a fully connected layer, and 512-dimensional image features are passed through a Relu activation layer and a dropout layer;

4.2 Converting the input image features into three vectors Q, K and V through three linear matrixes, and performing Multi-head calculation, wherein the Multi-head number is set to be 8 heads and used for calculating the similarity of the features of each target area in the next step;

4.3 For each of the 8 heads, the visual similarity is calculated by the method of calculating Q, K vector similarity in the conventional transducer model

4.4 Converting coordinates of every two pairs of targets of the image to obtain relative position information lambda (m, n) of the coordinates, wherein lambda (m, n) contains the position relevance between the two targets;

4.5 Embedding lambda (m, n) obtained in step 4.4) by means of a sinusoid function, the embedded lambda (m, n) being multiplied by a linear conversion layer W _G Then obtaining the image position similarity of the object m and the object n through a nonlinear activation function Relu

4.6 Storing each target category label detected by Faster R-CNN, finding the corresponding row in the category co-occurrence probability matrix and attribute similarity matrix obtained in the step 2), and obtaining the semantic similarity of every two categories in the image

4.7 Will) be

Fusion by softmax manipulation>

Then +.>

Adding a attention coefficient a so that +.>

The attention of the user is (1-a), and the proper value of a is obtained through the subsequent training process, so that the similarity w of every two categories of the visual information, the position information and the external knowledge is obtained ^mn ；

4.8 Using the similarity matrix w calculated by the Q, K vector ^mn Multiplying the V vectors to obtain weighted regional features fused into the target relationship in the image, and then connecting the 8V vectors;

4.9 Inputting the V vector obtained in the step 4.8) into a feedforward neural network formed by two layers of full-connection layers after residual connection layers and normalization, then inputting the output of the feedforward neural network into the next layer of an encoder after the residual connection layers and normalization, and transmitting the output to a decoder after the operation of 6 layers of encoders in total;

5) The decoder processes the output phase from the encoder:

5.1 Firstly, carrying out position coding on word information of a group trunk description sentence corresponding to each training picture in the training data set;

5.2 The word vector after the position coding in the step 5.1) is subjected to the mask Multi-Head Attention to obtain a weighted sentence word feature vector which is used as a V vector of the Multi-Head self-Attention in the next step of the first layer;

5.3 The output of the last layer of the encoder is converted into Q, K vectors through two linear conversion layers, and then multi-head self-attention operation is carried out on the Q, K vectors and the V vectors obtained in the step 5.2), so that V vectors integrated with similarity information are obtained;

5.4 Transmitting the V vector obtained in the step 5.3) to a feedforward neural network after the residual error connecting layer and normalization, and taking output as input of the next layer of the decoder after the residual error connecting layer and normalization are performed once;

5.5 The second decoder layer does not have the same modulated Multi-Head Attention operation as the first layer starts, but directly performs Multi-Head self-Attention calculation, and the Q, K and V vectors are all the results from the output of the last decoder after three linear matrix transformations;

5.6 After a total 6-layer decoder layer operation, the output vector passes through a linear layer and a softmax layer to obtain the corresponding probability vector of the next word;

6) Test image description sentence stage:

6.1 Inputting the test set image, extracting the characteristics of the target area of the image from the trained model and calculating the similarity;

6.2 Taking the image characteristics weighted by the similarity coefficients as the input of an encoder-decoder framework, and gradually outputting the word probability of the descriptive sentence of each decoded image;

6.3 Using the method of the beam search, selecting the beam size as 2, finally obtaining the evaluation index score of each output sentence, and selecting the highest score as the test result.

For each data set, the number of training set images is far greater than the number of test set images, after each round of epoch training is finished, temporary effect verification is carried out by using the verification set images, the result is recorded, and the checkpoint is saved, so that in case the training is interrupted, the training can be continued from the interrupted place next time, and when the last step of testing is passed, the model with the largest training round number is not necessarily selected, but the model with the best intermediate effect can be selected.

Compared with the prior art, the technical scheme has the following characteristics:

(1) Innovatively using a transform frame-based encoder-decoder structure to introduce the transform method into the field of image description generation, and finding that the result is much better than the conventional method;

(2) The external knowledge based on the common sense of human is introduced, and a plurality of objects in the image are in pairs based on the common sense of human, for example, people and football, and the accuracy of describing sentence words can be greatly improved by the guidance of the external knowledge, so that the sentence words are more personified;

(3) Different from the traditional method for calculating the similarity of the Q vector and the K vector by using a transducer architecture, the method not only calculates the visual relationship of different areas, but also integrates the position relationship and the semantic relationship.

The method merges visual relations, image position relations and semantic relations among different targets in the image, and the higher-layer and more abstract features in the image are mined through the relations among the targets and common human knowledge, so that more vivid and accurate image description sentences are generated, and the multi-angle relation calculation can enable the relation mining among the targets to be more complete and reasonable.

Drawings

FIG. 1 is a schematic diagram of an overall frame of an embodiment;

fig. 2 is a schematic diagram of a self-attention computing process.

Detailed Description

The present invention will now be further illustrated, but not limited, by the following figures and examples.

Referring to fig. 1, the image description generation method based on the relationship between the external knowledge and the object includes the steps of:

1) Data set classification: the data set is divided into two main categories, wherein the first main category is a knowledge data set Visual Genome for extracting external knowledge; the second major class is an experimental dataset (such as MSCOCO, flicker 8K, etc.), which is divided into 3 sub-classes, namely a training dataset, a validation dataset, and a test dataset; in the embodiment, karpath split in MSCOCO2014 is adopted for training verification and test, wherein a training set comprises images and corresponding descriptive sentences for training various parameters of a model, the verification set is used for verifying the training effect of each cycle after the completion of the epoch training so as to save the best effect in all cycles, and the test set is used for obtaining the effect of a final model;

2) An external semantic knowledge extraction stage:

2.1 Obtaining 3000 categories with highest occurrence frequency in the Visual Genome data set through a statistical algorithm, wherein the category number of other data sets is generally within 3000, and then counting the frequency of the co-occurrence of all target categories in pairs through the relation branches in the Visual Genome data set to obtain a category co-occurrence probability matrix W of 3000 times 3000 _cls ；

2.2 Selecting the first 200 attribute categories with highest occurrence frequency in the Visual Genome data set to obtain attribute matrixes of 3000 target categories, wherein the dimension of the attribute matrixes is 3000 times 200, and then calculating JS divergence of every two categories as attribute similarity of the categories to obtain attribute similarity matrix W of 3000 times 3000 _att ；

2.3 Normalizing the category co-occurrence probability matrix and the attribute similarity matrix according to the rows, namely adding up the sum of each row to be 1, wherein the specific operation is that the values of each row are added to obtain the sum to be used as a denominator, and then dividing the value of each row by the denominator to obtain a normalized result;

2.4 For the target data set, the MSCOCO data sets are taken as an example, namely, each category in each MSCOCO data set has a number in Visual Genome, the MSCOCO data sets have 81 categories including background category, category names with similar meanings are replaced, for example, ball in the Visual Genome is replaced by sports_ball in the MSCOCO, and category co-occurrence probability matrixes obtained in the step 2.1) and category information of the attribute similarity matrixes obtained in the step 2.2) are represented by synnet;

3) Stage of extracting target region characteristics by Faster R-CNN:

3.1 Pre-training on the MSCOCO data set by using the Faster R-CNN on the gitsub, extracting the image characteristics of the training data set by using the ResNet-101 in the pre-trained model, discarding the last two full connection layers of the ResNet-101, and taking the image characteristics of the last two layers of the ResNet-101 as the input of the next step;

3.2 Generating candidate frames of a target area in the image and category information of each candidate frame through the RPN network according to the input characteristics of the step 3.1), wherein the categories are respectively classified into background information or foreground, the foreground is a target object, and the candidate frames with the overlapping rate exceeding 0.7 are deleted through a non-maximum suppression method;

4) The encoder processes the input feature phase:

4.2 Converting the input image features into three vectors Q, K and V through three linear matrixes, performing Multi-Head calculation, setting the number of heads of the Multi-Head to 8, and calculating the similarity of the features of each target area in the next step;

4.3 For each 8 heads, calculating the Q, K vector similarity of all target area features of the image, wherein the first calculation is visual similarity, namely, the visual similarity matrix is obtained by carrying out dot multiplication operation on each area feature vector and then normalization

4.7 Will) be

Fusion by softmax manipulation>

Then +.>

Adding a attention coefficient a so that +.>

5) The decoder processes the output phase from the encoder:

6) Test image description sentence stage:

The external knowledge acquisition stage mainly depends on Visual Genome data sets, and generally considers that the common sense knowledge of human beings can be regarded as the linguistic of dominant knowledge. The most representative dominant forms of knowledge may be attribute knowledge, such as apple being red, and pairwise relationship knowledge, such as people riding a bicycle. The first step of this example is to write a python script first, download the relation branch between attribute branch and goal according to the branch of Visual Genome data set, the format of attribute branch is usually "a red round apple", write statistical algorithm to get the top 200 attribute with highest appearance frequency, get 3000 times 200 category attribute matrix, get every two groups of 1 times 200 similarity of vector through the calculation of the pair JS divergence, different from most of the previous book methods adopts KL divergence, because KL divergence does not get symmetrical matrix representation, get symmetrical 3000 times 3000 different categories of 3000 times 3000 by JS divergence belongs toA metric matrix on similarity, which represents semantic information of a very high level of abstraction. Meanwhile, according to the attribute category similarity matrix of the detected object, the similarity of the similar object is definitely higher than that of other objects, so that the accuracy of the model can be well improved by embedding the similarity information into the model, and the attribute similarity matrix calculated by adding the statistics is marked as W in the embodiment _cls 。

Besides attribute similarity information, there is more direct semantic similarity information, namely a category co-occurrence probability similarity matrix, which directly represents the probability of co-occurrence of two objects in an image, for example, according to human common knowledge, the probability of co-occurrence of people and vehicles in one image is higher than the probability of co-occurrence of horses and vehicles in the same image, in this example, the two co-occurrence probabilities of 3000 categories before the occurrence frequency is counted according to Visual Genome data sets, and the information of the matrix can be integrated when the following encoder calculates the similarity of Q and K vectors of different categories. In the example, the attribute similarity and the category co-occurrence probability similarity are averaged and multiplied by a focus coefficient a, and the model is enabled to obtain the optimal value of a in the training process, so that the proportion of the visual similarity, the position similarity and the semantic similarity among different targets can be reasonably distributed, and the co-occurrence probability similarity matrix obtained in the step is marked as W _att 。

After the semantic information representing the common sense of human is obtained, the example replaces the category of different data sets, and because the attribute similarity matrix and the category co-occurrence probability matrix in the example are indexed by synnet of category labels on the Visual Genome data sets, the index replacement is carried out on different data sets, and synnet of Mole categories is assigned to the categories with similar meaning on other data sets, so that the external knowledge of the example is applied to the data sets in different directions in a targeted manner.

For the encoder processing by Faster R-CNN input feature stage, the key step of this example is to first pass the input 512-dimensional feature X through W _Q ，W _k ，W _V The three conversion matrixes are converted into three vectors of Q, K and V, and the three vectors are commonThe formula is shown as (1):

Q＝W _Q X，K＝W _K X，V＝W _V (1)，

the input X is a 512-dimensional vector, Q, K, V is a 64-dimensional vector, so the three transformation matrix dimensions are 512 by 64. Then, the Q, K vector similarity stage of each region feature is calculated, and the conventional method is that, as shown in formula (2), feature vectors of two regions are directly point-multiplied and then divided by 64:

wherein Ω _A Is a matrix of N multiplied by N, the elements of the matrix being the coefficient of relationship w between target m and target N ^mn For example, 50 targets are shared in an image, the value of N is 50, so the key of the problem is to calculate the relation coefficient between different categories, calculate the visual similarity of the two categories by using the method of formula (2), introduce the position similarity calculation process, the position coordinate information obtained by the fast R-CNN is (x, y, w, h), and the four values respectively represent the width w of the image of the central point coordinate (x, y) of the target frame and the height h of the image. Then converting the position coordinates according to the formula (3):

then, the lambda (m, n) is first dimension embedded, the thought of the lambda (m, n) is converted into 64 dimensions, and then multiplied by a linear matrix W _G Finally, the image position relationship similarity of the object m and the object n is obtained by activating the non-linear function Relu as shown in the formula (4)

The next is the semantic similarity calculation stage, taking the MSCOCO dataset as an example, with a total number of categories of 81, the co-occurrence probability matrix W in the external knowledge thus extracted _cls And attribute similarity matrix W _att The size of (1) is 81 times 81, and assuming that the target class detected by the fast R-CNN is m and the other target classes in the image are n1 to n30, then the row of the target class corresponding to the two external knowledge matrices is found first, and then the co-occurrence probability value corresponding to each other class is found

And attribute similarity value->

Then, an averaging process is performed as shown in the formula (5):

then, the multi-level relation information between the targets is integrated, and the visual similarity is firstly calculated

And position similarity

Integration into +.>

The specific operation is shown in the formula (6): />

Then is

Adding a attention coefficient a so that +.>

The attention of the user is (1-a), and the proper value of a is obtained through the training process, so that the similarity w of every two categories of the fusion visual information, the position information and the external knowledge is obtained ^mn As shown in formula (7):

the similarity information of three kinds of relations of vision, position and semantics is fused to measure the relation among different targets, and then the V vector of each target is multiplied.

After each layer of the encoder obtains the V vectors, carrying out concatate operation on the V vectors with 8 heads, then transmitting the obtained vectors to the next layer of the encoder after residual connection and layer normalization, wherein the thought of residual connection refers to classical ResNet, parameter transmission can be better assisted by residual connection, training speed is accelerated, the layer number of the encoder is set to be six, the output of the next layer is taken as input by the last layer, namely, the V vectors output by the last layer are regarded as X vectors to carry out self-attention operation, and the integral operation formula of multi-head self-attention is shown as formula (8):

after the V vector output by each layer is obtained, the V vector passes through a feedforward neural network layer which is calculated according to elements, and the specific calculation formula is shown as (9):

FFN(x)＝max(0,xW ₁ +b ₁ )W ₂ +b ₂ (9)，

wherein W is ₁ ,W ₂ ,b ₁ ,b ₂ The weight parameters and the deviation of the two full-connection layers are respectively, and after the calculation and the transmission of the 6-layer data are finished, the output of the last layer of the encoder is used as the input of the first layer of the decoder.

For the decoder processing data phase, each word in the real description sentence corresponding to each picture of the training set is first encoded into a vector form. Step 1) first, the magnetic dictionary with each dataset encodes each word in the form of one-hot vectors, but such vectors represent dimensions that are too high to handle and therefore require word embedding. Step 2) embedding the word vector with high dimension into the word vector with low dimension, and embedding the word into the vector with 512 dimension by using word2vec method, because the words of this example are generated one by one, the information of the third word cannot be known in advance when generating the second word, unlike the Multi-Head self-Attention of the encoder, the first layer of the decoder adopts the Masked Multi-Head Attention, and the word information after each training time step is Masked by setting an upper triangular matrix, as shown in fig. 2. And 3) taking the word vector after the mask Multi-Head Attention as a V vector, linearly converting the output of the last layer of the encoder into a Q vector and a K vector, calculating the similarity of the Q vector and the K vector according to a traditional mode by adopting a method of a formula (2), multiplying the V vector, transmitting the V vector to a feedforward neural network after residual connection and layer normalization, and taking the output of the feedforward neural network as the input of the next layer of the decoder. Step) the second layer of the decoder gets the output of the first layer and omits the mask Multi-Head Attention of the first layer, but directly refers to the formula (2) to perform Multi-Head self-Attention calculation, uses the output of the upper layer as X, performs Linear conversion to calculate similarity coefficients and multiplies the similarity coefficients by V vectors, uses the V vectors formed by connecting 8 heads as the input of the next layer, and outputs the probability vector of the next word after the V vectors of the output of the last layer pass through a Linear layer and a softmax layer, as shown in figure 2.

In the test stage, a final model and a best model are obtained after 60 epochs are trained in a training stage by using a test set picture of MSCOCO classical karplath split and a description sentence development test thereof, and are stored in a checkpoint, and then test indexes include BLER 1, BLER 2, BLER 3, BLER 4, meteor and Rouge-L, spicer. The test stage does not have the word guidance of a group trunk describing sentence, a brand new image is given, a model generates a describing sentence by itself, the parameters obtained by training play a key role in the process, the trained parameters are equivalent to an experienced image processor which learns the relations among a large number of targets in the image, the relations among the targets in the image can be reasonably modeled and embedded into a classical transducer frame, and the relation representation of all targets in each image can be obtained by means of residual connection, layer normalization and a feedforward neural network, so that the method has a strong guidance role when a decoder decodes the word vectors of the sentences.

The encoder and the decoder need to perform a position coding operation at the first layer, and for the encoder, an image region with two-dimensional position information is coded, and because the image has two dimensions of length and width, the coding mode is different from the traditional transform position coding mode, after coding, the two-dimensional image region characteristics can be converted into a one-dimensional representation mode similar to a sentence sequence, and the coding is performed by using a sinusoid function mode, wherein the formula is shown as the formula (10):

the pos in the formula (10) represents the position of the image area, i represents dimensions, each dimension corresponds to a single pixel, so that a two-dimensional image can be represented as a one-dimensional sequence, and the position coding mode at the decoder end is slightly different.

In the specific implementation stage, the experiment is developed on a pyrach platform, the training and testing of the model are completed through a Ubantu16.04 system with an NVIDIA 1060ti display card, the parameters of the model are set as follows, d is set to be 512 dimensions, and for an original input image, the original input image is firstly extracted into 1024-dimensional feature vectors through ResNet in a fast R-CNN, and then the feature vectors are input to an encoder of the model. In encoding, the visual similarity, the positional similarity, and the semantic similarity coefficient are all set to 64 degrees. The batch size in the training process is set to be 10, 30 epochs are trained by adopting a traditional cross entropy mode, then reinforcement learning training is carried out, and a loss function in the cross entropy training mode is shown in the following formula (11):

the goal of the cross entropy training phase is to minimize the loss function, i.e. to have the probability p in the above formula word as close to 1 as possible, where the meaning of the probability p is the probability of generating the next ground word from the t-1 words generated previously. The reinforcement learning stage is started after the training of 30 epochs in a cross entropy mode, namely, sentence generation is regarded as a reinforcement learning problem by adopting a sampling method, the training aims at maximizing a reward expectation function, and the formula is shown as (12):

in the above formula, θ is a parameter of the model, and after reinforcement learning of 30 epochs, the model can be tested with the best round of results.

Claims

1. The image description generation method based on the relationship between the external knowledge and the target is characterized by comprising the following steps:

2) An external semantic knowledge extraction stage:

3) Stage of extracting target region characteristics by Faster R-CNN:

3.1 Pre-training on a training data set by using Faster R-CNN on the github, extracting image features of the training data set by using ResNet-101 in a pre-trained model, discarding the last two full-connection layers of the ResNet-101, and taking the image features of the last two layers of the ResNet-101 as input of the next step;

3.2 Generating candidate frames of a target area in an image and category information of each candidate frame through an RPN (reactive power network) according to the image characteristics obtained in the step 3.1), wherein the categories are classified into background information or foreground, the foreground is a target object, and the candidate frames with the overlapping rate exceeding 0.7 are deleted through a non-maximum suppression method;

4) The encoder processes the input feature phase:

4.2 The input image features are converted into three vectors of Q, K and V through three linear matrixes, multi-head calculation is carried out, and the number of the Multi-heads is set to be 8 heads;

4.3 For each of the 8 heads, the visual similarity is calculated by calculating the Q, K vector similarity method by the method in the conventional transducer model

4.5 Embedding lambda (m, n) obtained in step 4.4) by means of a sinusoid function, multiplying lambda (m, n) after embedding by a linear conversion layer W _G Then obtaining the image position similarity of the object m and the object n through a nonlinear activation function Relu

/>

4.7 Will) be

Fusion by softmax manipulation>

Then +.>

Adding a attention coefficient a such that

5) The decoder processes the output phase from the encoder:

6) Test image description sentence stage: