CN113609326B - Image description generation method based on relationship between external knowledge and target - Google Patents

Image description generation method based on relationship between external knowledge and target Download PDF

Info

Publication number
CN113609326B
CN113609326B CN202110982666.1A CN202110982666A CN113609326B CN 113609326 B CN113609326 B CN 113609326B CN 202110982666 A CN202110982666 A CN 202110982666A CN 113609326 B CN113609326 B CN 113609326B
Authority
CN
China
Prior art keywords
image
layer
similarity
data set
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110982666.1A
Other languages
Chinese (zh)
Other versions
CN113609326A (en
Inventor
李志欣
陈天宇
张灿龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN202110982666.1A priority Critical patent/CN113609326B/en
Publication of CN113609326A publication Critical patent/CN113609326A/en
Application granted granted Critical
Publication of CN113609326B publication Critical patent/CN113609326B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image description generation method based on the relationship between external knowledge and a target, which comprises the following steps: 1) Classifying a data set; 2) An external semantic knowledge extraction stage; 3) A stage of extracting target region characteristics by using Faster R-CNN; 4) The encoder processes the input feature phase; 5) The decoder processes the output phase from the encoder; 6) The test image describes the sentence phase. The method merges visual relations, image position relations and semantic relations among different targets in the image, and the higher-layer and more abstract features in the image are mined through the relations among the targets and common human knowledge, so that more vivid and accurate image description sentences are generated, and the multi-angle relation calculation can enable the relation mining among the targets to be more complete and reasonable.

Description

Image description generation method based on relationship between external knowledge and target
Technical Field
The invention relates to the technical field of image description generation, in particular to an image description generation method based on the relationship between external knowledge and targets.
Background
With the popularization of networks and digital devices, various media image data are rapidly increased, and the automatic description generation of images has very wide application prospects, such as early education of infants, vision of blind persons and the like. And it relates to two fields of computer vision and natural language processing, and has very important research significance.
Image description generation has become a very active area of research since the 60 s of the last century, and early applications of relatively broad technologies have included mainly search-based methods and template-based methods. Based on the searching, a certain image is given, the similar image and the description sentences thereof are found from the existing image library, and the defect is obvious, namely the robustness to the new image is poor. The method based on the template is to divide sentences into templates such as a subject, a predicate and an object, and fill the contents of the sentences according to the image contents after the sentences are divided into the templates, and the main defect of the method is that the generated sentences are relatively dead and inflexible.
With the rapid development of deep learning later, many innovative methods of applying deep learning methods to image description generation have emerged. Inspired by natural language processing, li Feifei et al in 2015 use encoder-decoder models for the NLP field in image description generation proposed NIC models that encode image information into a fixed length vector, which is then passed to a decoder to generate words one by one. But not the whole image information is needed when generating all words, in order to better refine the features of the image for sentence generation, attention mechanisms are fused into the NIC model, so that the model can pay attention to the most useful part of the image when generating different words, and the model not only greatly improves the performance of the model, but also lifts a hot tide for integrating the attention mechanisms into the image description work. Many of the latter works are improvements based on this, such as the appearance of semantic attention mechanisms that target semantic information in images as an object of attention allocation. In order to overcome the defect that the previous attention mechanism is mainly used for carrying out the distribution of attention weights in the unit of even blocks of an image, the later proposal is combined with the bottom-up and top-down attention mechanism, and the core innovation point is that a target detection module such as a Faster R-CNN is used for extracting target area characteristics, then the attention is distributed in the unit of the target characteristics in the bottom-up mechanism, the attention is distributed in the unit of even blocks of the whole image in the top-down mechanism, and a decoder part adopts a double-layer long-short-term memory unit to combine hidden layer states generated by the two mechanisms, so that the operation is considered to be a further milestone operation in the image description generation.
Later, as the transducer model became popular in the NLP field, many transducer-based image description methods developed in succession and exhibited better performance than most conventional methods, which improved the attention mechanism module in the input position coding and encoder section to better adapt to the image-input model than the transducer model for natural language processing.
However, the present method cannot integrate the abstract high-level feature of the relationship between the targets in the image into the attention mechanism, and according to the common sense of people, the relationship between the targets in one image also contains important information, for example, when one image contains football, people appear around football with high probability, and how to utilize the semantic information with important guiding significance for sentence generation is a very worthy problem.
Disclosure of Invention
Aiming at the defect that the traditional image description generation method cannot effectively utilize the semantic relation among image targets, the invention provides the image description generation method based on the external knowledge and the relation among the targets. The method merges visual relations, image position relations and semantic relations among different targets in the image, and the higher-layer and more abstract features in the image are mined through the relations among the targets and common human knowledge, so that more vivid and accurate image description sentences are generated, and the multi-angle relation calculation can enable the relation mining among the targets to be more complete and reasonable.
The technical scheme for realizing the aim of the invention is as follows:
the image description generation method based on the relationship between the external knowledge and the target comprises the following steps:
1) Data set classification: dividing the data set into two main categories, wherein the first main category is a knowledge data set for extracting external knowledge; the second major class is an experimental data set, which is divided into 3 sub-classes, namely a training data set, a verification data set and a test data set;
2) An external semantic knowledge extraction stage:
2.1 Obtaining 3000 categories with highest occurrence frequency in the knowledge data set through a statistical algorithm, and then counting the occurrence frequency of the two-by-two co-occurrence of all target categories to obtain a category co-occurrence probability matrix of 3000 times 3000;
2.2 Selecting the first 200 attribute categories with highest occurrence frequency in the knowledge data set to obtain attribute matrixes of 3000 target categories, wherein the dimension of the attribute matrixes is 3000 times 200, and then, calculating JS divergences of every two categories as attribute similarity of the categories to obtain an attribute similarity matrix of 3000 times 3000;
2.3 Normalizing the category co-occurrence probability matrix and the attribute similarity matrix according to rows, namely adding up the sum of each row to be 1;
2.4 The category replacement is carried out on the experimental data sets, namely, each category in each experimental data set has a number in the knowledge data set, wherein the category co-occurrence probability matrix obtained in the step 2.1) and the category information of the attribute similarity matrix obtained in the step 2.2) are represented by synnet;
3) Stage of extracting target region characteristics by Faster R-CNN:
3.1 Pre-training on a training data set of an experimental data set by using a fast R-CNN on a github, extracting image features of the training data set by using ResNet-101 in a pre-trained model, discarding the last two full connection layers of the ResNet-101, and taking the image features of the last two layers of the ResNet-101 as input of the next step;
3.2 Generating candidate frames of a target area in an image and category information of each candidate frame through an RPN (remote procedure network) by the characteristics obtained in the step 3.1), wherein the categories are respectively classified into background information or foreground, the foreground is a target object, and deleting the candidate frames with the overlapping rate exceeding 0.7 through a non-maximum suppression method;
3.3 Uniformly converting the rest candidate frames into vectors of which 14 is multiplied by 14 and 2048 through a RoI (row-level) layer in a fast R-CNN layer, and then inputting the vectors into an additional CNN layer to predict the category of each region frame and refine the target region frame;
3.4 Using averaging pooling to generate 2048-dimensional feature vectors as input to the encoder;
4) The encoder processes the input feature phase:
4.1 First, 2048-dimensional image features are reduced to 512-dimensional by a fully connected layer, and 512-dimensional image features are passed through a Relu activation layer and a dropout layer;
4.2 Converting the input image features into three vectors Q, K and V through three linear matrixes, and performing Multi-head calculation, wherein the Multi-head number is set to be 8 heads and used for calculating the similarity of the features of each target area in the next step;
4.3 For each of the 8 heads, the visual similarity is calculated by the method of calculating Q, K vector similarity in the conventional transducer model
Figure BDA0003229728980000031
4.4 Converting coordinates of every two pairs of targets of the image to obtain relative position information lambda (m, n) of the coordinates, wherein lambda (m, n) contains the position relevance between the two targets;
4.5 Embedding lambda (m, n) obtained in step 4.4) by means of a sinusoid function, the embedded lambda (m, n) being multiplied by a linear conversion layer W G Then obtaining the image position similarity of the object m and the object n through a nonlinear activation function Relu
Figure BDA0003229728980000032
4.6 Storing each target category label detected by Faster R-CNN, finding the corresponding row in the category co-occurrence probability matrix and attribute similarity matrix obtained in the step 2), and obtaining the semantic similarity of every two categories in the image
Figure BDA0003229728980000033
4.7 Will) be
Figure BDA0003229728980000034
Fusion by softmax manipulation>
Figure BDA0003229728980000035
Then +.>
Figure BDA0003229728980000036
Adding a attention coefficient a so that +.>
Figure BDA0003229728980000037
The attention of the user is (1-a), and the proper value of a is obtained through the subsequent training process, so that the similarity w of every two categories of the visual information, the position information and the external knowledge is obtained mn
4.8 Using the similarity matrix w calculated by the Q, K vector mn Multiplying the V vectors to obtain weighted regional features fused into the target relationship in the image, and then connecting the 8V vectors;
4.9 Inputting the V vector obtained in the step 4.8) into a feedforward neural network formed by two layers of full-connection layers after residual connection layers and normalization, then inputting the output of the feedforward neural network into the next layer of an encoder after the residual connection layers and normalization, and transmitting the output to a decoder after the operation of 6 layers of encoders in total;
5) The decoder processes the output phase from the encoder:
5.1 Firstly, carrying out position coding on word information of a group trunk description sentence corresponding to each training picture in the training data set;
5.2 The word vector after the position coding in the step 5.1) is subjected to the mask Multi-Head Attention to obtain a weighted sentence word feature vector which is used as a V vector of the Multi-Head self-Attention in the next step of the first layer;
5.3 The output of the last layer of the encoder is converted into Q, K vectors through two linear conversion layers, and then multi-head self-attention operation is carried out on the Q, K vectors and the V vectors obtained in the step 5.2), so that V vectors integrated with similarity information are obtained;
5.4 Transmitting the V vector obtained in the step 5.3) to a feedforward neural network after the residual error connecting layer and normalization, and taking output as input of the next layer of the decoder after the residual error connecting layer and normalization are performed once;
5.5 The second decoder layer does not have the same modulated Multi-Head Attention operation as the first layer starts, but directly performs Multi-Head self-Attention calculation, and the Q, K and V vectors are all the results from the output of the last decoder after three linear matrix transformations;
5.6 After a total 6-layer decoder layer operation, the output vector passes through a linear layer and a softmax layer to obtain the corresponding probability vector of the next word;
6) Test image description sentence stage:
6.1 Inputting the test set image, extracting the characteristics of the target area of the image from the trained model and calculating the similarity;
6.2 Taking the image characteristics weighted by the similarity coefficients as the input of an encoder-decoder framework, and gradually outputting the word probability of the descriptive sentence of each decoded image;
6.3 Using the method of the beam search, selecting the beam size as 2, finally obtaining the evaluation index score of each output sentence, and selecting the highest score as the test result.
For each data set, the number of training set images is far greater than the number of test set images, after each round of epoch training is finished, temporary effect verification is carried out by using the verification set images, the result is recorded, and the checkpoint is saved, so that in case the training is interrupted, the training can be continued from the interrupted place next time, and when the last step of testing is passed, the model with the largest training round number is not necessarily selected, but the model with the best intermediate effect can be selected.
Compared with the prior art, the technical scheme has the following characteristics:
(1) Innovatively using a transform frame-based encoder-decoder structure to introduce the transform method into the field of image description generation, and finding that the result is much better than the conventional method;
(2) The external knowledge based on the common sense of human is introduced, and a plurality of objects in the image are in pairs based on the common sense of human, for example, people and football, and the accuracy of describing sentence words can be greatly improved by the guidance of the external knowledge, so that the sentence words are more personified;
(3) Different from the traditional method for calculating the similarity of the Q vector and the K vector by using a transducer architecture, the method not only calculates the visual relationship of different areas, but also integrates the position relationship and the semantic relationship.
The method merges visual relations, image position relations and semantic relations among different targets in the image, and the higher-layer and more abstract features in the image are mined through the relations among the targets and common human knowledge, so that more vivid and accurate image description sentences are generated, and the multi-angle relation calculation can enable the relation mining among the targets to be more complete and reasonable.
Drawings
FIG. 1 is a schematic diagram of an overall frame of an embodiment;
fig. 2 is a schematic diagram of a self-attention computing process.
Detailed Description
The present invention will now be further illustrated, but not limited, by the following figures and examples.
Referring to fig. 1, the image description generation method based on the relationship between the external knowledge and the object includes the steps of:
1) Data set classification: the data set is divided into two main categories, wherein the first main category is a knowledge data set Visual Genome for extracting external knowledge; the second major class is an experimental dataset (such as MSCOCO, flicker 8K, etc.), which is divided into 3 sub-classes, namely a training dataset, a validation dataset, and a test dataset; in the embodiment, karpath split in MSCOCO2014 is adopted for training verification and test, wherein a training set comprises images and corresponding descriptive sentences for training various parameters of a model, the verification set is used for verifying the training effect of each cycle after the completion of the epoch training so as to save the best effect in all cycles, and the test set is used for obtaining the effect of a final model;
2) An external semantic knowledge extraction stage:
2.1 Obtaining 3000 categories with highest occurrence frequency in the Visual Genome data set through a statistical algorithm, wherein the category number of other data sets is generally within 3000, and then counting the frequency of the co-occurrence of all target categories in pairs through the relation branches in the Visual Genome data set to obtain a category co-occurrence probability matrix W of 3000 times 3000 cls
2.2 Selecting the first 200 attribute categories with highest occurrence frequency in the Visual Genome data set to obtain attribute matrixes of 3000 target categories, wherein the dimension of the attribute matrixes is 3000 times 200, and then calculating JS divergence of every two categories as attribute similarity of the categories to obtain attribute similarity matrix W of 3000 times 3000 att
2.3 Normalizing the category co-occurrence probability matrix and the attribute similarity matrix according to the rows, namely adding up the sum of each row to be 1, wherein the specific operation is that the values of each row are added to obtain the sum to be used as a denominator, and then dividing the value of each row by the denominator to obtain a normalized result;
2.4 For the target data set, the MSCOCO data sets are taken as an example, namely, each category in each MSCOCO data set has a number in Visual Genome, the MSCOCO data sets have 81 categories including background category, category names with similar meanings are replaced, for example, ball in the Visual Genome is replaced by sports_ball in the MSCOCO, and category co-occurrence probability matrixes obtained in the step 2.1) and category information of the attribute similarity matrixes obtained in the step 2.2) are represented by synnet;
3) Stage of extracting target region characteristics by Faster R-CNN:
3.1 Pre-training on the MSCOCO data set by using the Faster R-CNN on the gitsub, extracting the image characteristics of the training data set by using the ResNet-101 in the pre-trained model, discarding the last two full connection layers of the ResNet-101, and taking the image characteristics of the last two layers of the ResNet-101 as the input of the next step;
3.2 Generating candidate frames of a target area in the image and category information of each candidate frame through the RPN network according to the input characteristics of the step 3.1), wherein the categories are respectively classified into background information or foreground, the foreground is a target object, and the candidate frames with the overlapping rate exceeding 0.7 are deleted through a non-maximum suppression method;
3.3 Uniformly converting the rest candidate frames into vectors of which 14 is multiplied by 14 and 2048 through a RoI (row-level) layer in a fast R-CNN layer, and then inputting the vectors into an additional CNN layer to predict the category of each region frame and refine the target region frame;
3.4 Using averaging pooling to generate 2048-dimensional feature vectors as input to the encoder;
4) The encoder processes the input feature phase:
4.1 First, 2048-dimensional image features are reduced to 512-dimensional by a fully connected layer, and 512-dimensional image features are passed through a Relu activation layer and a dropout layer;
4.2 Converting the input image features into three vectors Q, K and V through three linear matrixes, performing Multi-Head calculation, setting the number of heads of the Multi-Head to 8, and calculating the similarity of the features of each target area in the next step;
4.3 For each 8 heads, calculating the Q, K vector similarity of all target area features of the image, wherein the first calculation is visual similarity, namely, the visual similarity matrix is obtained by carrying out dot multiplication operation on each area feature vector and then normalization
Figure BDA0003229728980000061
4.4 Converting coordinates of every two pairs of targets of the image to obtain relative position information lambda (m, n) of the coordinates, wherein lambda (m, n) contains the position relevance between the two targets;
4.5 Embedding lambda (m, n) obtained in step 4.4) by means of a sinusoid function, the embedded lambda (m, n) being multiplied by a linear conversion layer W G Then obtaining the image position similarity of the object m and the object n through a nonlinear activation function Relu
Figure BDA0003229728980000062
4.6 Storing each target category label detected by Faster R-CNN, finding the corresponding row in the category co-occurrence probability matrix and attribute similarity matrix obtained in the step 2), and obtaining the semantic similarity of every two categories in the image
Figure BDA0003229728980000063
4.7 Will) be
Figure BDA0003229728980000064
Fusion by softmax manipulation>
Figure BDA0003229728980000065
Then +.>
Figure BDA0003229728980000066
Adding a attention coefficient a so that +.>
Figure BDA0003229728980000067
The attention of the user is (1-a), and the proper value of a is obtained through the subsequent training process, so that the similarity w of every two categories of the visual information, the position information and the external knowledge is obtained mn
4.8 Using the similarity matrix w calculated by the Q, K vector mn Multiplying the V vectors to obtain weighted regional features fused into the target relationship in the image, and then connecting the 8V vectors;
4.9 Inputting the V vector obtained in the step 4.8) into a feedforward neural network formed by two layers of full-connection layers after residual connection layers and normalization, then inputting the output of the feedforward neural network into the next layer of an encoder after the residual connection layers and normalization, and transmitting the output to a decoder after the operation of 6 layers of encoders in total;
5) The decoder processes the output phase from the encoder:
5.1 Firstly, carrying out position coding on word information of a group trunk description sentence corresponding to each training picture in the training data set;
5.2 The word vector after the position coding in the step 5.1) is subjected to the mask Multi-Head Attention to obtain a weighted sentence word feature vector which is used as a V vector of the Multi-Head self-Attention in the next step of the first layer;
5.3 The output of the last layer of the encoder is converted into Q, K vectors through two linear conversion layers, and then multi-head self-attention operation is carried out on the Q, K vectors and the V vectors obtained in the step 5.2), so that V vectors integrated with similarity information are obtained;
5.4 Transmitting the V vector obtained in the step 5.3) to a feedforward neural network after the residual error connecting layer and normalization, and taking output as input of the next layer of the decoder after the residual error connecting layer and normalization are performed once;
5.5 The second decoder layer does not have the same modulated Multi-Head Attention operation as the first layer starts, but directly performs Multi-Head self-Attention calculation, and the Q, K and V vectors are all the results from the output of the last decoder after three linear matrix transformations;
5.6 After a total 6-layer decoder layer operation, the output vector passes through a linear layer and a softmax layer to obtain the corresponding probability vector of the next word;
6) Test image description sentence stage:
6.1 Inputting the test set image, extracting the characteristics of the target area of the image from the trained model and calculating the similarity;
6.2 Taking the image characteristics weighted by the similarity coefficients as the input of an encoder-decoder framework, and gradually outputting the word probability of the descriptive sentence of each decoded image;
6.3 Using the method of the beam search, selecting the beam size as 2, finally obtaining the evaluation index score of each output sentence, and selecting the highest score as the test result.
The external knowledge acquisition stage mainly depends on Visual Genome data sets, and generally considers that the common sense knowledge of human beings can be regarded as the linguistic of dominant knowledge. The most representative dominant forms of knowledge may be attribute knowledge, such as apple being red, and pairwise relationship knowledge, such as people riding a bicycle. The first step of this example is to write a python script first, download the relation branch between attribute branch and goal according to the branch of Visual Genome data set, the format of attribute branch is usually "a red round apple", write statistical algorithm to get the top 200 attribute with highest appearance frequency, get 3000 times 200 category attribute matrix, get every two groups of 1 times 200 similarity of vector through the calculation of the pair JS divergence, different from most of the previous book methods adopts KL divergence, because KL divergence does not get symmetrical matrix representation, get symmetrical 3000 times 3000 different categories of 3000 times 3000 by JS divergence belongs toA metric matrix on similarity, which represents semantic information of a very high level of abstraction. Meanwhile, according to the attribute category similarity matrix of the detected object, the similarity of the similar object is definitely higher than that of other objects, so that the accuracy of the model can be well improved by embedding the similarity information into the model, and the attribute similarity matrix calculated by adding the statistics is marked as W in the embodiment cls
Besides attribute similarity information, there is more direct semantic similarity information, namely a category co-occurrence probability similarity matrix, which directly represents the probability of co-occurrence of two objects in an image, for example, according to human common knowledge, the probability of co-occurrence of people and vehicles in one image is higher than the probability of co-occurrence of horses and vehicles in the same image, in this example, the two co-occurrence probabilities of 3000 categories before the occurrence frequency is counted according to Visual Genome data sets, and the information of the matrix can be integrated when the following encoder calculates the similarity of Q and K vectors of different categories. In the example, the attribute similarity and the category co-occurrence probability similarity are averaged and multiplied by a focus coefficient a, and the model is enabled to obtain the optimal value of a in the training process, so that the proportion of the visual similarity, the position similarity and the semantic similarity among different targets can be reasonably distributed, and the co-occurrence probability similarity matrix obtained in the step is marked as W att
After the semantic information representing the common sense of human is obtained, the example replaces the category of different data sets, and because the attribute similarity matrix and the category co-occurrence probability matrix in the example are indexed by synnet of category labels on the Visual Genome data sets, the index replacement is carried out on different data sets, and synnet of Mole categories is assigned to the categories with similar meaning on other data sets, so that the external knowledge of the example is applied to the data sets in different directions in a targeted manner.
For the encoder processing by Faster R-CNN input feature stage, the key step of this example is to first pass the input 512-dimensional feature X through W Q ,W k ,W V The three conversion matrixes are converted into three vectors of Q, K and V, and the three vectors are commonThe formula is shown as (1):
Q=W Q X,K=W K X,V=W V (1),
the input X is a 512-dimensional vector, Q, K, V is a 64-dimensional vector, so the three transformation matrix dimensions are 512 by 64. Then, the Q, K vector similarity stage of each region feature is calculated, and the conventional method is that, as shown in formula (2), feature vectors of two regions are directly point-multiplied and then divided by 64:
Figure BDA0003229728980000081
wherein Ω A Is a matrix of N multiplied by N, the elements of the matrix being the coefficient of relationship w between target m and target N mn For example, 50 targets are shared in an image, the value of N is 50, so the key of the problem is to calculate the relation coefficient between different categories, calculate the visual similarity of the two categories by using the method of formula (2), introduce the position similarity calculation process, the position coordinate information obtained by the fast R-CNN is (x, y, w, h), and the four values respectively represent the width w of the image of the central point coordinate (x, y) of the target frame and the height h of the image. Then converting the position coordinates according to the formula (3):
Figure BDA0003229728980000082
then, the lambda (m, n) is first dimension embedded, the thought of the lambda (m, n) is converted into 64 dimensions, and then multiplied by a linear matrix W G Finally, the image position relationship similarity of the object m and the object n is obtained by activating the non-linear function Relu as shown in the formula (4)
Figure BDA0003229728980000083
Figure BDA0003229728980000084
The next is the semantic similarity calculation stage, taking the MSCOCO dataset as an example, with a total number of categories of 81, the co-occurrence probability matrix W in the external knowledge thus extracted cls And attribute similarity matrix W att The size of (1) is 81 times 81, and assuming that the target class detected by the fast R-CNN is m and the other target classes in the image are n1 to n30, then the row of the target class corresponding to the two external knowledge matrices is found first, and then the co-occurrence probability value corresponding to each other class is found
Figure BDA0003229728980000091
And attribute similarity value->
Figure BDA0003229728980000092
Then, an averaging process is performed as shown in the formula (5):
Figure BDA0003229728980000093
then, the multi-level relation information between the targets is integrated, and the visual similarity is firstly calculated
Figure BDA0003229728980000094
And position similarity
Figure BDA0003229728980000095
Integration into +.>
Figure BDA0003229728980000096
The specific operation is shown in the formula (6): />
Figure BDA0003229728980000097
Then is
Figure BDA0003229728980000098
Adding a attention coefficient a so that +.>
Figure BDA0003229728980000099
The attention of the user is (1-a), and the proper value of a is obtained through the training process, so that the similarity w of every two categories of the fusion visual information, the position information and the external knowledge is obtained mn As shown in formula (7):
Figure BDA00032297289800000910
the similarity information of three kinds of relations of vision, position and semantics is fused to measure the relation among different targets, and then the V vector of each target is multiplied.
After each layer of the encoder obtains the V vectors, carrying out concatate operation on the V vectors with 8 heads, then transmitting the obtained vectors to the next layer of the encoder after residual connection and layer normalization, wherein the thought of residual connection refers to classical ResNet, parameter transmission can be better assisted by residual connection, training speed is accelerated, the layer number of the encoder is set to be six, the output of the next layer is taken as input by the last layer, namely, the V vectors output by the last layer are regarded as X vectors to carry out self-attention operation, and the integral operation formula of multi-head self-attention is shown as formula (8):
Figure BDA00032297289800000911
after the V vector output by each layer is obtained, the V vector passes through a feedforward neural network layer which is calculated according to elements, and the specific calculation formula is shown as (9):
FFN(x)=max(0,xW 1 +b 1 )W 2 +b 2 (9),
wherein W is 1 ,W 2 ,b 1 ,b 2 The weight parameters and the deviation of the two full-connection layers are respectively, and after the calculation and the transmission of the 6-layer data are finished, the output of the last layer of the encoder is used as the input of the first layer of the decoder.
For the decoder processing data phase, each word in the real description sentence corresponding to each picture of the training set is first encoded into a vector form. Step 1) first, the magnetic dictionary with each dataset encodes each word in the form of one-hot vectors, but such vectors represent dimensions that are too high to handle and therefore require word embedding. Step 2) embedding the word vector with high dimension into the word vector with low dimension, and embedding the word into the vector with 512 dimension by using word2vec method, because the words of this example are generated one by one, the information of the third word cannot be known in advance when generating the second word, unlike the Multi-Head self-Attention of the encoder, the first layer of the decoder adopts the Masked Multi-Head Attention, and the word information after each training time step is Masked by setting an upper triangular matrix, as shown in fig. 2. And 3) taking the word vector after the mask Multi-Head Attention as a V vector, linearly converting the output of the last layer of the encoder into a Q vector and a K vector, calculating the similarity of the Q vector and the K vector according to a traditional mode by adopting a method of a formula (2), multiplying the V vector, transmitting the V vector to a feedforward neural network after residual connection and layer normalization, and taking the output of the feedforward neural network as the input of the next layer of the decoder. Step) the second layer of the decoder gets the output of the first layer and omits the mask Multi-Head Attention of the first layer, but directly refers to the formula (2) to perform Multi-Head self-Attention calculation, uses the output of the upper layer as X, performs Linear conversion to calculate similarity coefficients and multiplies the similarity coefficients by V vectors, uses the V vectors formed by connecting 8 heads as the input of the next layer, and outputs the probability vector of the next word after the V vectors of the output of the last layer pass through a Linear layer and a softmax layer, as shown in figure 2.
In the test stage, a final model and a best model are obtained after 60 epochs are trained in a training stage by using a test set picture of MSCOCO classical karplath split and a description sentence development test thereof, and are stored in a checkpoint, and then test indexes include BLER 1, BLER 2, BLER 3, BLER 4, meteor and Rouge-L, spicer. The test stage does not have the word guidance of a group trunk describing sentence, a brand new image is given, a model generates a describing sentence by itself, the parameters obtained by training play a key role in the process, the trained parameters are equivalent to an experienced image processor which learns the relations among a large number of targets in the image, the relations among the targets in the image can be reasonably modeled and embedded into a classical transducer frame, and the relation representation of all targets in each image can be obtained by means of residual connection, layer normalization and a feedforward neural network, so that the method has a strong guidance role when a decoder decodes the word vectors of the sentences.
The encoder and the decoder need to perform a position coding operation at the first layer, and for the encoder, an image region with two-dimensional position information is coded, and because the image has two dimensions of length and width, the coding mode is different from the traditional transform position coding mode, after coding, the two-dimensional image region characteristics can be converted into a one-dimensional representation mode similar to a sentence sequence, and the coding is performed by using a sinusoid function mode, wherein the formula is shown as the formula (10):
Figure BDA0003229728980000111
the pos in the formula (10) represents the position of the image area, i represents dimensions, each dimension corresponds to a single pixel, so that a two-dimensional image can be represented as a one-dimensional sequence, and the position coding mode at the decoder end is slightly different.
In the specific implementation stage, the experiment is developed on a pyrach platform, the training and testing of the model are completed through a Ubantu16.04 system with an NVIDIA 1060ti display card, the parameters of the model are set as follows, d is set to be 512 dimensions, and for an original input image, the original input image is firstly extracted into 1024-dimensional feature vectors through ResNet in a fast R-CNN, and then the feature vectors are input to an encoder of the model. In encoding, the visual similarity, the positional similarity, and the semantic similarity coefficient are all set to 64 degrees. The batch size in the training process is set to be 10, 30 epochs are trained by adopting a traditional cross entropy mode, then reinforcement learning training is carried out, and a loss function in the cross entropy training mode is shown in the following formula (11):
Figure BDA0003229728980000112
the goal of the cross entropy training phase is to minimize the loss function, i.e. to have the probability p in the above formula word as close to 1 as possible, where the meaning of the probability p is the probability of generating the next ground word from the t-1 words generated previously. The reinforcement learning stage is started after the training of 30 epochs in a cross entropy mode, namely, sentence generation is regarded as a reinforcement learning problem by adopting a sampling method, the training aims at maximizing a reward expectation function, and the formula is shown as (12):
Figure BDA0003229728980000113
in the above formula, θ is a parameter of the model, and after reinforcement learning of 30 epochs, the model can be tested with the best round of results.

Claims (1)

1. The image description generation method based on the relationship between the external knowledge and the target is characterized by comprising the following steps:
1) Data set classification: dividing the data set into two main categories, wherein the first main category is a knowledge data set for extracting external knowledge; the second major class is an experimental data set, which is divided into 3 sub-classes, namely a training data set, a verification data set and a test data set;
2) An external semantic knowledge extraction stage:
2.1 Obtaining 3000 categories with highest occurrence frequency in the knowledge data set through a statistical algorithm, and then counting the occurrence frequency of the two-by-two co-occurrence of all target categories to obtain a category co-occurrence probability matrix of 3000 times 3000;
2.2 Selecting the first 200 attribute categories with highest occurrence frequency in the knowledge data set to obtain attribute matrixes of 3000 target categories, wherein the dimension of the attribute matrixes is 3000 times 200, and then, calculating JS divergences of every two categories as attribute similarity of the categories to obtain an attribute similarity matrix of 3000 times 3000;
2.3 Normalizing the category co-occurrence probability matrix and the attribute similarity matrix according to rows, namely adding up the sum of each row to be 1;
2.4 The category replacement is carried out on the experimental data sets, namely, each category in each experimental data set has a number in the knowledge data set, wherein the category co-occurrence probability matrix obtained in the step 2.1) and the category information of the attribute similarity matrix obtained in the step 2.2) are represented by synnet;
3) Stage of extracting target region characteristics by Faster R-CNN:
3.1 Pre-training on a training data set by using Faster R-CNN on the github, extracting image features of the training data set by using ResNet-101 in a pre-trained model, discarding the last two full-connection layers of the ResNet-101, and taking the image features of the last two layers of the ResNet-101 as input of the next step;
3.2 Generating candidate frames of a target area in an image and category information of each candidate frame through an RPN (reactive power network) according to the image characteristics obtained in the step 3.1), wherein the categories are classified into background information or foreground, the foreground is a target object, and the candidate frames with the overlapping rate exceeding 0.7 are deleted through a non-maximum suppression method;
3.3 Uniformly converting the rest candidate frames into vectors of which 14 is multiplied by 14 and 2048 through a RoI (row-level) layer in a fast R-CNN layer, and then inputting the vectors into an additional CNN layer to predict the category of each region frame and refine the target region frame;
3.4 Using averaging pooling to generate 2048-dimensional feature vectors as input to the encoder;
4) The encoder processes the input feature phase:
4.1 First, 2048-dimensional image features are reduced to 512-dimensional by a fully connected layer, and 512-dimensional image features are passed through a Relu activation layer and a dropout layer;
4.2 The input image features are converted into three vectors of Q, K and V through three linear matrixes, multi-head calculation is carried out, and the number of the Multi-heads is set to be 8 heads;
4.3 For each of the 8 heads, the visual similarity is calculated by calculating the Q, K vector similarity method by the method in the conventional transducer model
Figure FDA0003229728970000011
4.4 Converting coordinates of every two pairs of targets of the image to obtain relative position information lambda (m, n) of the coordinates, wherein lambda (m, n) contains the position relevance between the two targets;
4.5 Embedding lambda (m, n) obtained in step 4.4) by means of a sinusoid function, multiplying lambda (m, n) after embedding by a linear conversion layer W G Then obtaining the image position similarity of the object m and the object n through a nonlinear activation function Relu
Figure FDA0003229728970000021
4.6 Storing each target category label detected by Faster R-CNN, finding the corresponding row in the category co-occurrence probability matrix and attribute similarity matrix obtained in the step 2), and obtaining the semantic similarity of every two categories in the image
Figure FDA0003229728970000022
/>
4.7 Will) be
Figure FDA0003229728970000023
Fusion by softmax manipulation>
Figure FDA0003229728970000024
Then +.>
Figure FDA0003229728970000025
Adding a attention coefficient a such that
Figure FDA0003229728970000026
The attention of the user is (1-a), and the proper value of a is obtained through the subsequent training process, so that the similarity w of every two categories of the visual information, the position information and the external knowledge is obtained mn
4.8 Using the similarity matrix w calculated by the Q, K vector mn Multiplying the V vectors to obtain weighted regional features fused into the target relationship in the image, and then connecting the 8V vectors;
4.9 Inputting the V vector obtained in the step 4.8) into a feedforward neural network formed by two layers of full-connection layers after residual connection layers and normalization, then inputting the output of the feedforward neural network into the next layer of an encoder after the residual connection layers and normalization, and transmitting the output to a decoder after the operation of 6 layers of encoders in total;
5) The decoder processes the output phase from the encoder:
5.1 Firstly, carrying out position coding on word information of a group trunk description sentence corresponding to each training picture in the training data set;
5.2 The word vector after the position coding in the step 5.1) is subjected to the mask Multi-Head Attention to obtain a weighted sentence word feature vector which is used as a V vector of the Multi-Head self-Attention in the next step of the first layer;
5.3 The output of the last layer of the encoder is converted into Q, K vectors through two linear conversion layers, and then multi-head self-attention operation is carried out on the Q, K vectors and the V vectors obtained in the step 5.2), so that V vectors integrated with similarity information are obtained;
5.4 Transmitting the V vector obtained in the step 5.3) to a feedforward neural network after the residual error connecting layer and normalization, and taking output as input of the next layer of the decoder after the residual error connecting layer and normalization are performed once;
5.5 The second decoder layer does not have the same modulated Multi-Head Attention operation as the first layer starts, but directly performs Multi-Head self-Attention calculation, and the Q, K and V vectors are all the results from the output of the last decoder after three linear matrix transformations;
5.6 After a total 6-layer decoder layer operation, the output vector passes through a linear layer and a softmax layer to obtain the corresponding probability vector of the next word;
6) Test image description sentence stage:
6.1 Inputting the test set image, extracting the characteristics of the target area of the image from the trained model and calculating the similarity;
6.2 Taking the image characteristics weighted by the similarity coefficients as the input of an encoder-decoder framework, and gradually outputting the word probability of the descriptive sentence of each decoded image;
6.3 Using the method of the beam search, selecting the beam size as 2, finally obtaining the evaluation index score of each output sentence, and selecting the highest score as the test result.
CN202110982666.1A 2021-08-25 2021-08-25 Image description generation method based on relationship between external knowledge and target Active CN113609326B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110982666.1A CN113609326B (en) 2021-08-25 2021-08-25 Image description generation method based on relationship between external knowledge and target

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110982666.1A CN113609326B (en) 2021-08-25 2021-08-25 Image description generation method based on relationship between external knowledge and target

Publications (2)

Publication Number Publication Date
CN113609326A CN113609326A (en) 2021-11-05
CN113609326B true CN113609326B (en) 2023-04-28

Family

ID=78341994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110982666.1A Active CN113609326B (en) 2021-08-25 2021-08-25 Image description generation method based on relationship between external knowledge and target

Country Status (1)

Country Link
CN (1) CN113609326B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417046B (en) * 2022-03-31 2022-07-12 腾讯科技(深圳)有限公司 Training method of feature extraction model, image retrieval method, device and equipment
CN116012685B (en) * 2022-12-20 2023-06-16 中国科学院空天信息创新研究院 Image description generation method based on fusion of relation sequence and visual sequence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160467A (en) * 2019-05-31 2020-05-15 北京理工大学 Image description method based on conditional random field and internal semantic attention
CN112784848A (en) * 2021-02-04 2021-05-11 东北大学 Image description generation method based on multiple attention mechanisms and external knowledge
CN113220891A (en) * 2021-06-15 2021-08-06 北京邮电大学 Unsupervised concept-to-sentence based generation confrontation network image description algorithm
CN113298151A (en) * 2021-05-26 2021-08-24 中国电子科技集团公司第五十四研究所 Remote sensing image semantic description method based on multi-level feature fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10678845B2 (en) * 2018-04-02 2020-06-09 International Business Machines Corporation Juxtaposing contextually similar cross-generation images

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160467A (en) * 2019-05-31 2020-05-15 北京理工大学 Image description method based on conditional random field and internal semantic attention
CN112784848A (en) * 2021-02-04 2021-05-11 东北大学 Image description generation method based on multiple attention mechanisms and external knowledge
CN113298151A (en) * 2021-05-26 2021-08-24 中国电子科技集团公司第五十四研究所 Remote sensing image semantic description method based on multi-level feature fusion
CN113220891A (en) * 2021-06-15 2021-08-06 北京邮电大学 Unsupervised concept-to-sentence based generation confrontation network image description algorithm

Also Published As

Publication number Publication date
CN113609326A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
Dong et al. Predicting visual features from text for image and video caption retrieval
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
CN110288665B (en) Image description method based on convolutional neural network, computer-readable storage medium and electronic device
CN113420807A (en) Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method
CN107871014A (en) A kind of big data cross-module state search method and system based on depth integration Hash
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN114549850B (en) Multi-mode image aesthetic quality evaluation method for solving modal missing problem
CN111966812A (en) Automatic question answering method based on dynamic word vector and storage medium
CN111931795A (en) Multi-modal emotion recognition method and system based on subspace sparse feature fusion
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN117218498B (en) Multi-modal large language model training method and system based on multi-modal encoder
CN113704437A (en) Knowledge base question-answering method integrating multi-head attention mechanism and relative position coding
CN115331075A (en) Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN116543289B (en) Image description method based on encoder-decoder and Bi-LSTM attention model
CN117437317A (en) Image generation method, apparatus, electronic device, storage medium, and program product
CN116434058A (en) Image description generation method and system based on visual text alignment
CN114661874B (en) Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels
CN116701996A (en) Multi-modal emotion analysis method, system, equipment and medium based on multiple loss functions
CN113554040B (en) Image description method and device based on condition generation countermeasure network
CN115291888A (en) Software community warehouse mining method and device based on self-attention interactive network
CN114511813A (en) Video semantic description method and device
Kasi et al. A Deep Learning Based Cross Model Text to Image Generation using DC-GAN
CN113239678A (en) Multi-angle attention feature matching method and system for answer selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant