CN113609326A - Image description generation method based on external knowledge and target relation - Google Patents

Image description generation method based on external knowledge and target relation Download PDF

Info

Publication number
CN113609326A
CN113609326A CN202110982666.1A CN202110982666A CN113609326A CN 113609326 A CN113609326 A CN 113609326A CN 202110982666 A CN202110982666 A CN 202110982666A CN 113609326 A CN113609326 A CN 113609326A
Authority
CN
China
Prior art keywords
image
layer
similarity
data set
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110982666.1A
Other languages
Chinese (zh)
Other versions
CN113609326B (en
Inventor
李志欣
陈天宇
张灿龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN202110982666.1A priority Critical patent/CN113609326B/en
Publication of CN113609326A publication Critical patent/CN113609326A/en
Application granted granted Critical
Publication of CN113609326B publication Critical patent/CN113609326B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image description generation method based on the relationship between external knowledge and an object, which comprises the following steps: 1) classifying the data set; 2) an external semantic knowledge extraction stage; 3) the step of extracting the characteristic of the target area by fast R-CNN; 4) the encoder processes the input characteristic stage; 5) the decoder processes the output stage from the encoder; 6) the test image describes a sentence phase. The method integrates visual relations, image position relations and semantic relations among different targets in the image, and excavates higher-layer more abstract characteristics in the image through the relations among the targets and human general knowledge, so that more vivid and accurate image description sentences are generated, and the relation excavation among the targets can be more sufficient and reasonable through multi-angle relation calculation.

Description

Image description generation method based on external knowledge and target relation
Technical Field
The invention relates to the technical field of image description generation, in particular to an image description generation method based on external knowledge and the relation between targets.
Background
With the popularization of network and digital equipment, various media image data are rapidly increased, and the automatic description generation of images has very wide application prospects, such as early education of infants, blind vision and the like. And the method relates to two fields of computer vision and natural language processing, and has very important research significance.
Image description generation has been a very active area of research since the 60's last century, and early technologies that have been more widely used have mainly included search-based methods and template-based methods. Based on the search, a certain image is given, an image similar to the image and a description sentence thereof are found from an existing image library, and the defect is obvious, namely the robustness for a new image is poor. The template-based method is to divide the sentence into templates such as < subject, predicate, object > and fill the sentence content according to the image content, and the main drawback of the method is that the generated sentence is rigid and inflexible.
Later, with the rapid development of deep learning, many innovative methods for applying the deep learning method to image description generation appeared. Inspired by natural language processing, li fei et al proposed a NIC model in 2015 that used an encoder-decoder model for NLP domain for image description generation, where the model encodes image information into a vector of fixed length, and passes it to a decoder to generate words one by one. However, the information of the whole image is not needed when all words are generated, in order to better refine the characteristics of the image for sentence generation, the attention mechanism is fused into the NIC model later, so that the model can focus on the most useful part of the image when different words are generated, the model not only greatly improves the performance of the model, but also raises the heat tide of fusing the attention mechanism into the image description work. Many of the latter works are improvements based on this, such as the semantic attention mechanism proposed by using the target semantic information in the image as the attention allocation object. In order to overcome the problem that the prior attention mechanism mostly uses uniform blocks of images as units to distribute attention weights, the later proposes a combined bottom-up and top-down attention mechanism, wherein the core innovation point is that a target detection module such as fast R-CNN is used for extracting target region characteristics, then the target characteristics are used as units to distribute attention in the bottom-up mechanism, the uniform blocks of the whole image are used as units to distribute attention in the top-down mechanism, a decoder part adopts a double-layer long-short term memory unit to combine hidden layer states generated by the two mechanisms, and the work is considered as a further milestone work in image description generation.
Later, as the Transformer model became popular in the NLP field, many Transformer-based image description methods were developed and showed better performance than most of the conventional methods, which compared to the Transformer model for natural language processing, improved the attention-based model at the input position coding and encoder section to better adapt to the model with image as input.
However, the current method cannot integrate the abstract high-level feature of the relationship between the objects of the images into the attention mechanism, and according to the common sense of people, the relationship between the objects in one image also contains important information, for example, when one image contains a football, people can appear around the football with a high probability, and how to utilize the semantic information which has important guiding significance for sentence generation is a very worthy of research.
Disclosure of Invention
The invention provides an image description generation method based on external knowledge and the relation between objects, aiming at the defect that the semantic relation between image objects cannot be effectively utilized by the traditional image description generation method. The method integrates visual relations, image position relations and semantic relations among different targets in the image, and excavates higher-layer more abstract characteristics in the image through the relations among the targets and human general knowledge, so that more vivid and accurate image description sentences are generated, and the relation excavation among the targets can be more sufficient and reasonable through multi-angle relation calculation.
The technical scheme for realizing the purpose of the invention is as follows:
the image description generation method based on the external knowledge and the relationship between the targets comprises the following steps:
1) classifying the data set: dividing the data set into two main classes, wherein the first main class is a knowledge data set for extracting external knowledge; the second large class is an experimental data set and is divided into 3 subclasses, namely a training data set, a verification data set and a test data set;
2) external semantic knowledge extraction stage:
2.1) obtaining 3000 categories with the highest occurrence frequency in the knowledge data set through a statistical algorithm, and then counting the frequency of the common occurrence of every two target categories to obtain a category co-occurrence probability matrix of 3000 times 3000;
2.2) selecting the top 200 attribute categories with the highest occurrence frequency in the knowledge data set to obtain attribute matrixes of 3000 target categories, wherein the dimensionality of each attribute matrix is 3000 multiplied by 200, and then calculating the JS divergence of every two categories as the attribute similarity of the categories to obtain an attribute similarity matrix of 3000 multiplied by 3000;
2.3) normalizing the category co-occurrence probability matrix and the attribute similarity matrix according to rows, namely adding the sum of each row to be 1;
2.4) carrying out category replacement on the experimental data sets, namely the category in each experimental data set has a number in a knowledge data set, wherein category co-occurrence probability matrix obtained in the step 2.1) and category information of the attribute similarity matrix obtained in the step 2.2) are represented by synnet;
3) the step of extracting the characteristic of the target area by fast R-CNN:
3.1) performing pre-training on a training data set of an experimental data set by using fast R-CNN on gitubs, extracting image features of the training data set through ResNet-101 in a pre-trained model, discarding the last two layers of full connection layers of ResNet-101, and taking the image features of the second layer from the last of ResNet-101 as the input of the next step;
3.2) generating candidate frames of the target area in the image and the category information of each candidate frame by the characteristics obtained in the step 3.1) through an RPN network, wherein the category is two categories, namely background information or foreground, the foreground is a target object, and deleting the candidate frames with the overlapping rate exceeding 0.7 by a non-maximum suppression method;
3.3) uniformly converting the remaining candidate boxes into vectors of 14 times 2048 through a RoI posing layer in the Faster R-CNN, and then inputting the vectors into an additional CNN layer to predict the category of each area box and a refined target area box;
3.4) generating 2048-dimensional feature vectors using average pooling as input to the encoder;
4) the encoder processes the input features stage:
4.1) reducing the 2048-dimensional image feature into 512 dimensions through a full connection layer, and passing the 512-dimensional image feature through a Relu activation layer and a dropout layer;
4.2) converting the input image features into three vectors of Q, K and V through three linear matrixes, and performing Multi-head calculation, wherein the number of heads of the Multi-head is set to 8, and the Multi-head calculation is used for calculating the similarity of the features of each target region in the next step;
4.3) calculating the visual similarity by the Q and K vector similarity calculation method in the traditional Transformer model for each 8 heads
Figure BDA0003229728980000031
4.4) converting coordinates of each pair of target frames of the image to obtain relative position information lambda (m, n) of the two targets, wherein the lambda (m, n) contains position relevance between the two targets;
4.5) embedding the lambda (m, n) obtained in the step 4.4) by a sinussoid function, and multiplying the embedded lambda (m, n) by a linear conversion layer WGThen obtaining the image position similarity of the target m and the target n through a nonlinear activation function Relu
Figure BDA0003229728980000032
4.6) labeling and storing each target category detected by Faster R-CNN, finding corresponding rows in the category co-occurrence probability matrix and the attribute similarity matrix obtained in the step 2), and obtaining the semantic similarity of every two categories in the image
Figure BDA0003229728980000033
4.7) mixing
Figure BDA0003229728980000034
Fusion by softmax operation
Figure BDA0003229728980000035
Then is that
Figure BDA0003229728980000036
Adding a focus coefficient a, such that
Figure BDA0003229728980000037
The attention degree is (1-a), and a proper value a is obtained through a training process subsequently to obtain the similarity w of every two categories fusing the visual information, the position information and the external knowledgemn
4.8) similarity matrix w calculated using the Q, K vectorsmnMultiplying the V vectors to obtain weighted regional characteristics of target relationships fused into the image, and then connecting the 8V vectors;
4.9) inputting the V vector obtained in the step 4.8) into a feedforward neural network formed by two fully-connected layers after residual connection layers and normalization, then inputting the output of the feedforward neural network into the next layer of the encoder after the residual connection layers and normalization, and transmitting the output to the decoder after the operation of 6 layers of encoders in total;
5) the decoder processes the output stage from the encoder:
5.1) carrying out position coding on word information of a group route description sentence corresponding to each training picture in the training data set;
5.2) obtaining a weighted sentence word feature vector by the position-coded word vector obtained in the step 5.1) through Masked Multi-Head attachment, and taking the weighted sentence word feature vector as a V vector of Multi-Head self-Attention in the next step of the first layer;
5.3) converting the output of the last layer of the encoder into Q and K vectors through two linear conversion layers, and then carrying out multi-head self-attention operation on the Q and K vectors and the V vectors obtained in the step 5.2) to obtain the V vectors after the similarity information is fused;
5.4) transmitting the V vector obtained in the step 5.3) to a feedforward neural network after residual connecting layer and normalization, and taking the output as the input of the next layer of the decoder after one residual connecting layer and normalization;
5.5) the second layer decoder layer does not have Masked Multi-Head orientation operation as the first layer, but directly carries out Multi-Head self-Attention calculation, and the Q, K and V vectors are all the results output by the decoder of the previous layer and transformed by three linear matrixes;
5.6) after the operation of a total 6-layer decoder layer, the output vector passes through a linear layer and a softmax layer to obtain the corresponding probability vector of the next word;
6) the test image describes a sentence stage:
6.1) inputting a test set image, extracting image target area characteristics from the trained model and calculating similarity;
6.2) taking the image characteristics weighted by the similarity coefficient as the input of an encoder-decoder framework, and gradually outputting the description sentence word probability of each decoded image;
6.3) selecting the beamsize to be 2 by adopting a beamsearch method, finally obtaining the evaluation index score of each output sentence, and selecting the highest score as the test result.
And setting the number of images in the training set to be far greater than that of images in the test set for each data set, after each round of epoch training is finished, performing temporary effect verification by using images in the verification set, recording results, and storing checkpoint so that the training can be continued from the interrupted place next time in case of interruption of the training, and when the last step of the training is passed, selecting the model with the largest number of training rounds but selecting the middle model with the best effect.
Compared with the prior art, the technical scheme has the following characteristics:
(1) the Transformer method is innovatively introduced into the field of image description generation by using a Transformer framework-based encoder-decoder structure, and the result is found to be much better than that of the traditional method;
(2) external knowledge based on human general knowledge is introduced, and a plurality of objects in the image appear in pairs based on the human general knowledge, such as 'people' and 'football', and the accuracy of describing sentence words can be greatly improved and the sentence words are more anthropomorphic under the guidance of the external knowledge;
(3) different from a traditional method for calculating Q and K vector similarity by a Transformer architecture, the method not only calculates the visual relation of different areas, but also integrates the position relation and the semantic relation.
The method integrates visual relations, image position relations and semantic relations among different targets in the image, and excavates higher-layer more abstract characteristics in the image through the relations among the targets and human general knowledge, so that more vivid and accurate image description sentences are generated, and the relation excavation among the targets can be more sufficient and reasonable through multi-angle relation calculation.
Drawings
FIG. 1 is a schematic overall frame diagram of the embodiment;
fig. 2 is a schematic diagram of a self-attention computing process.
Detailed Description
The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.
Referring to fig. 1, an image description generation method based on external knowledge and a relationship between objects includes the steps of:
1) classifying the data set: the data sets are divided into two main classes, wherein the first main class is a knowledge data set Visual Genome for extracting external knowledge; the second major category is an experimental data set (such as MSCOCO, Flicker 8K and the like), and is divided into 3 subclasses, namely a training data set, a verification data set and a test data set; in the embodiment, Karpathy split in MSCOCO2014 is adopted for training, verifying and testing, wherein a training set comprises images and corresponding description sentences for training various parameters of the model, the verification set is used for verifying the training effect of the round after each epoch training is completed so as to save the best effect in all rounds, and the test set is used for obtaining the effect of the final model;
2) external semantic knowledge extraction stage:
2.1) obtaining 3000 categories with the highest occurrence frequency in the Visual Genome data set through a statistical algorithm, wherein the number of the categories of other data sets is within 3000 generally, then counting the frequency of the common occurrence of all the target categories in pairs through the relation branches in the Visual Genome data set to obtain a category co-occurrence probability matrix W with the ratio of 3000 to 3000cls
2.2) selecting the top 200 attribute categories with highest occurrence frequency in the Visual Genome data set to obtain attribute matrixes of 3000 target categories, wherein the dimensionality of the attribute matrixes is 3000 multiplied by 200, then calculating the JS divergence of every two categories as the attribute similarity of the categories to obtain an attribute similarity matrix W multiplied by 3000att
2.3) normalizing the category co-occurrence probability matrix and the attribute similarity matrix according to rows, namely adding the sum of each row to be 1, specifically, adding the value of each row to obtain the sum as a denominator, and then dividing the value of each row by the denominator to obtain a normalized result;
2.4) carrying out category replacement on the target data set, wherein the MSCOCO data set is taken as an example in this example, namely the categories in each MSCOCO data set have a serial number in Visual Genome, the MSCOCO data sets have 81 categories including the background, category names with similar meanings are replaced, for example, the ball in the Visual Genome is replaced by the sports _ ball in the MSCOCO, and the category co-occurrence probability matrix obtained in step 2.1) and the category information of the attribute similarity matrix obtained in step 2.2) are represented by synnet;
3) the step of extracting the characteristic of the target area by fast R-CNN:
3.1) performing pre-training on an MSCOCO data set by using fast R-CNN on github, extracting image features of the training data set through ResNet-101 in a pre-trained model, discarding the last two full-connected layers of ResNet-101, and taking the image features of the last layer of ResNet-101 as the input of the next step;
3.2) generating candidate frames of the target area in the image and the category information of each candidate frame by the input features of the step 3.1) through an RPN network, wherein the category is two categories, namely background information or foreground, and the foreground is a target object, and deleting the candidate frames with the overlapping rate exceeding 0.7 by a non-maximum suppression method;
3.3) uniformly converting the remaining candidate boxes into vectors of 14 times 2048 through a RoI posing layer in the Faster R-CNN, and then inputting the vectors into an additional CNN layer to predict the category of each area box and a refined target area box;
3.4) generating 2048-dimensional feature vectors using average pooling as input to the encoder;
4) the encoder processes the input features stage:
4.1) reducing the 2048-dimensional image feature into 512 dimensions through a full connection layer, and passing the 512-dimensional image feature through a Relu activation layer and a dropout layer;
4.2) converting the input image features into three vectors of Q, K and V through three linear matrixes, performing Multi-Head calculation, and setting the number of heads of the Multi-Head to 8 for calculating the similarity of the features of each target region in the next step;
4.3) calculating the similarity of the Q and K vectors of all target region features of the image for each 8 heads, firstly calculating the visual similarity, namely performing point multiplication on each region feature vector and then normalizing to obtain a visual similarity matrix
Figure BDA0003229728980000061
4.4) converting coordinates of each pair of target frames of the image to obtain relative position information lambda (m, n) of the two targets, wherein the lambda (m, n) contains position relevance between the two targets;
4.5) embedding the lambda (m, n) obtained in the step 4.4) by a sinussoid function, and multiplying the embedded lambda (m, n) by a linear conversion layer WGThen obtaining the image position similarity of the target m and the target n through a nonlinear activation function Relu
Figure BDA0003229728980000062
4.6) labeling and storing each target category detected by Faster R-CNN, finding corresponding rows in the category co-occurrence probability matrix and the attribute similarity matrix obtained in the step 2), and obtaining the semantic similarity of every two categories in the image
Figure BDA0003229728980000063
4.7) mixing
Figure BDA0003229728980000064
Fusion by softmax operation
Figure BDA0003229728980000065
Then is that
Figure BDA0003229728980000066
Adding a focus coefficient a, such that
Figure BDA0003229728980000067
The attention degree is (1-a), and a proper value a is obtained through a training process subsequently to obtain the similarity w of every two categories fusing the visual information, the position information and the external knowledgemn
4.8) similarity matrix w calculated using the Q, K vectorsmnMultiplying the V vectors to obtain weighted regional characteristics of target relationships fused into the image, and then connecting the 8V vectors;
4.9) inputting the V vector obtained in the step 4.8) into a feedforward neural network formed by two fully-connected layers after residual connection layers and normalization, then inputting the output of the feedforward neural network into the next layer of the encoder after the residual connection layers and normalization, and transmitting the output to the decoder after the operation of 6 layers of encoders in total;
5) the decoder processes the output stage from the encoder:
5.1) carrying out position coding on word information of a group route description sentence corresponding to each training picture in the training data set;
5.2) obtaining a weighted sentence word feature vector by the position-coded word vector obtained in the step 5.1) through Masked Multi-Head attachment, and taking the weighted sentence word feature vector as a V vector of Multi-Head self-Attention in the next step of the first layer;
5.3) converting the output of the last layer of the encoder into Q and K vectors through two linear conversion layers, and then carrying out multi-head self-attention operation on the Q and K vectors and the V vectors obtained in the step 5.2) to obtain the V vectors after the similarity information is fused;
5.4) transmitting the V vector obtained in the step 5.3) to a feedforward neural network after residual connecting layer and normalization, and taking the output as the input of the next layer of the decoder after one residual connecting layer and normalization;
5.5) the second layer decoder layer does not have Masked Multi-Head orientation operation as the first layer, but directly carries out Multi-Head self-Attention calculation, and the Q, K and V vectors are all the results output by the decoder of the previous layer and transformed by three linear matrixes;
5.6) after the operation of a total 6-layer decoder layer, the output vector passes through a linear layer and a softmax layer to obtain the corresponding probability vector of the next word;
6) the test image describes a sentence stage:
6.1) inputting a test set image, extracting image target area characteristics from the trained model and calculating similarity;
6.2) taking the image characteristics weighted by the similarity coefficient as the input of an encoder-decoder framework, and gradually outputting the description sentence word probability of each decoded image;
6.3) selecting the beamsize to be 2 by adopting a beamsearch method, finally obtaining the evaluation index score of each output sentence, and selecting the highest score as the test result.
The external knowledge acquisition stage of the example mainly depends on the Visual Genome data set, and common knowledge of human beings can be regarded as the linguistics of explicit knowledge. The most representative explicit forms of knowledge may be attribute knowledge, such as apple is red, and pair-wise relationship knowledge, such as riding a bicycle. The first step in this example is to write a python script according to Visual GenoThe me official website data set branch downloads the attribute branch and the target relation branch, the format of the attribute branch is usually 'an apple of a red circle', a statistical algorithm is compiled to obtain the first 200 attributes with the highest occurrence frequency, a 3000-200 category attribute matrix is obtained, similarity of every two groups of vectors 1-200 is obtained through paired JS divergence calculation, KL divergence is different from that adopted by most conventional methods, KL divergence is not represented by a symmetric matrix, and symmetric 3000-3000 different measurement matrixes on attribute similarity are obtained through JS divergence and represent very high-order abstract semantic information. At the same time, according to the attribute class similarity matrix of the detected object, the similarity of the same kind of object is definitely higher than that of other objects, so that the accuracy of the model can be well improved by embedding the similarity information into the modelcls
In addition to the attribute similarity information, there is also more direct semantic similarity information, i.e. a category co-occurrence probability similarity matrix, which directly represents the probability of the co-occurrence of two objects in an image, for example, according to the common sense of human beings, the probability of the simultaneous occurrence of a person and a vehicle in an image is higher than the probability of the co-occurrence of a horse and a vehicle in the same image, in this example, the co-occurrence probability of two classes of 3000 classes before the occurrence frequency is counted according to a Visual Genome data set, and this can be merged into the information of this matrix when the encoder calculates the similarity of different classes Q and K vectors. In the embodiment, the attribute similarity and the class co-occurrence probability similarity are averaged and then multiplied by an attention coefficient a, and the model obtains the optimal value of a in the training process, so that the proportion of the visual similarity, the position similarity and the semantic similarity among different targets can be reasonably distributed, and the co-occurrence probability similarity matrix obtained in the step is marked as Watt
After semantic information representing human common sense is obtained, the present example replaces different data set categories, and because the attribute similarity matrix and the category co-occurrence probability matrix in the present example are all indexed by synnet with category labels on the Visual Genome data set, index replacement is performed on different data sets, and the syncnet of the same category is assigned to the categories with similar meanings on other data sets, so that the external knowledge of the present example is pertinently applied to the data sets in different directions.
For the encoder to process the input features from Faster R-CNN, the key step in this example is to first pass the input 512-dimensional features X through WQ,Wk,WVThe three transformation matrices are converted into three vectors of Q, K and V, and the formula is shown as (1):
Q=WQX,K=WKX,V=WV (1),
the input X is a 512-dimensional vector, Q, K, V are 64-dimensional vectors, so the three transform matrix dimensions are 512 times 64. Then, in the stage of calculating the similarity of the Q and K vectors of the features of each region, the conventional method is to directly multiply the feature vectors of the two regions by a point and then divide the result by 64 as shown in formula (2):
Figure BDA0003229728980000081
wherein omegaAIs a matrix of N times N, the elements of the matrix being the coefficients of the relationship w between the target m and the target NmnFor example, if there are 50 targets in an image, the value of N is 50, so the key of the next problem is to calculate the relationship coefficient between different categories, calculate the visual similarity of the two categories by using the method of formula (2), and then introduce the process of calculating the position similarity, where the position coordinate information obtained by fast R-CNN is (x, y, w, h), and these four values respectively represent the width w of the center point coordinate (x, y) image of the target frame and the height h of the image. The position coordinates are then transformed according to equation (3):
Figure BDA0003229728980000082
then, dimension embedding is carried out on the lambda (m, n), thought of the lambda (m, n) is converted into 64 dimensions, and then the lambda (m, n) is multiplied by a lineProperty matrix WGAnd finally, activating by a nonlinear function Relu as shown in formula (4), so that the image position relationship similarity of the target m and the target n is obtained
Figure BDA0003229728980000083
Figure BDA0003229728980000084
The semantic similarity calculation stage follows, taking the MSCOCO dataset as an example, whose total number of classes is 81, thus extracting the co-occurrence probability matrix W in the extrinsic knowledgeclsAnd attribute similarity matrix WattThe size of (1) is 81 times 81, assuming that the object class detected by Faster R-CNN is m and the other object classes in the image are n1 to n30, then the row of the object class corresponding to the two external knowledge matrices is found, and then the co-occurrence probability values corresponding to each of the other classes are found
Figure BDA0003229728980000091
And attribute similarity value
Figure BDA0003229728980000092
Then, an averaging process is performed as shown in formula (5):
Figure BDA0003229728980000093
then, integrating the multi-level relation information between the targets, and firstly, obtaining the visual similarity
Figure BDA0003229728980000094
And position similarity
Figure BDA0003229728980000095
Integrated by softmax operation into
Figure BDA0003229728980000096
The specific operation is shown in formula (6):
Figure BDA0003229728980000097
then is that
Figure BDA0003229728980000098
Adding a focus coefficient a, such that
Figure BDA0003229728980000099
The attention degree is (1-a), and a proper value a is obtained through a training process subsequently, so that the similarity w of every two categories of fused visual information, position information and external knowledge is obtainedmnAs shown in equation (7):
Figure BDA00032297289800000910
the method is characterized in that the similarity information of three relationships of vision, position and semantics is fused to measure the relationship between different targets, and then the similarity information is multiplied by the V vector of each target.
After each layer of the encoder obtains a V vector, the V vectors of 8 heads are also subjected to a cancel operation, and then after a residual connection and layer normalization, the obtained vectors are transmitted to the next layer of the encoder, the idea of the residual connection refers to classical ResNet, parameter transmission can be better assisted by the residual connection, the training speed is increased, the number of layers of the encoder is set to six, the previous layer takes the output of the next layer as input, that is, the V vector output by the previous layer is taken as an X vector to perform a self-attention operation, and the overall multi-head self-attention operation formula is shown as formula (8):
Figure BDA00032297289800000911
after the V vector output by each layer is obtained, the feedforward neural network layer is calculated according to elements, and the specific calculation formula is shown in (9):
FFN(x)=max(0,xW1+b1)W2+b2 (9),
w therein1,W2,b1,b2The weight parameters and the deviation of the two full-connection layers are respectively, and when the data calculation and transmission of the 6 layers are finished, the output of the last layer of the encoder is used as the input of the first layer of the decoder.
For the data processing stage of the decoder, each word in the real description sentence corresponding to each picture in the training set is encoded into a vector form. Step 1) the magnetic dictionary, which has more than one data set first, encodes each word in the form of a one-hot vector, but such a vector representation is too high in dimensionality to handle, so word embedding is required. Step 2) embedding high-dimensional word vectors into low-dimensional word vectors, embedding words into 512-dimensional vectors by using a word2vec method, because the words of the example are generated one by one, the information of a third word cannot be known in advance when a second word is generated, different from the Multi-Head self-Attention of an encoder, a Masked Multi-Head Attention is adopted at the first layer of a decoder, and the word information after each training time step is shielded by setting an upper triangular matrix, as shown in FIG. 2. And 3) taking the word vector after Masked Multi-Head attachment as a V vector, converting the output of the last layer of the encoder into a Q vector and a K vector after linear conversion, calculating the similarity of the Q vector and the K vector by adopting a formula (2) according to a traditional mode, multiplying the similarity by the V vector, transmitting the result to a feedforward neural network after residual connection and layer normalization, and taking the output of the feedforward neural network as the input of the next layer of the decoder. Step) the second layer of the decoder gets the output of the first layer, omits Masked Multi-Head Attention of the first layer, but directly refers to formula (2) to perform Multi-Head self-Attention calculation, takes the output of the previous layer as X, performs Linear conversion, calculates a similarity coefficient, multiplies a V vector, takes the V vector formed by connecting 8 heads as the input of the next layer, and outputs the probability vector of the next word after the V vector output by the last layer passes through a Linear layer and a softmax layer, as shown in FIG. 2.
In the testing stage, a test set picture of classic karpathy split of MSCOCO and a description sentence thereof are used for unfolding the test, 60 epochs are trained on the model in the training stage to obtain a final model and a best model, the final model and the best model are stored in a checkpoint, and then test indexes are BLEU1, BLEU2, BLEU3, BLEU4, Meter and Rouge-L, Spicer. In the testing stage, a brand-new image is given without the description sentence word guidance of a ground channel, a model generates a description sentence by itself, parameters obtained by training play a key role at this time, the trained parameters are equivalent to an experienced image processor which learns the relationships among a large number of images, the relationships among the image targets can be reasonably modeled and embedded into a classical transform frame, and the relationship representation of all the targets in each image can be obtained by means of residual connection, layer normalization and a feedforward neural network, so that the method has a strong guidance effect when a decoder decodes sentence word vectors.
The method is characterized in that position coding operation is required to be carried out on a first layer at both an encoder end and a decoder end, an image region with two-dimensional position information is coded for the encoder, because an image has two dimensions of length and width, the coding mode is different from the traditional transform position coding mode, after coding, the two-dimensional image region characteristic can be transformed into a one-dimensional expression mode similar to a sentence sequence, and a sine solid function mode is used for coding, wherein the formula is shown as formula (10):
Figure BDA0003229728980000111
pos in the formula (10) represents the position of the image region, i represents the dimension, each dimension corresponds to a sine solid, so that the two-dimensional image can be represented in a one-dimensional sequence form, the position coding mode at the decoder end is slightly different, and the input of the decoder is a one-dimensional word sequence originally, so that the i at the decoder end represents the actual position of the word in the whole sentence.
In a specific implementation stage, the experiment is carried out on a pytorch platform, training and testing of a model are completed through an Ubantu16.04 system loaded with an NVIDIA 1060ti display card, parameters of the model are set as follows, d is set to be 512 dimensions, and for an original input image, the original input image is extracted into 1024-dimensional feature vectors through ResNet in fast R-CNN and then is input to an encoder of the model. In encoding, the visual similarity, the positional similarity, and the semantic similarity coefficient are all set to 64 degrees. The blocksize in the training process is set to 10, 30 epochs are trained in a traditional cross entropy mode, then reinforcement learning training is carried out, and a loss function in the cross entropy training mode is shown in the following formula (11):
Figure BDA0003229728980000112
the goal of the cross-entropy training phase is to minimize the penalty function, i.e., to make the probability p in the above formula as close to 1 as possible, the meaning of the probability p in the formula being the probability of generating the next group struc word from the t-1 words previously generated. After 30 epochs are trained in a cross entropy mode, a reinforcement learning phase is started, namely, a sentence generation is regarded as a reinforcement learning problem by adopting a sampling method, the training aim is to maximize an incentive expectation function, and a formula is shown as (12):
Figure BDA0003229728980000113
where θ is the parameter of the model, after learning 30 epochs intensively, the model can be tested with the best results.

Claims (1)

1. The image description generation method based on the relation between the external knowledge and the target is characterized by comprising the following steps:
1) classifying the data set: dividing the data set into two main classes, wherein the first main class is a knowledge data set for extracting external knowledge; the second large class is an experimental data set and is divided into 3 subclasses, namely a training data set, a verification data set and a test data set;
2) external semantic knowledge extraction stage:
2.1) obtaining 3000 categories with the highest occurrence frequency in the knowledge data set through a statistical algorithm, and then counting the frequency of the common occurrence of every two target categories to obtain a category co-occurrence probability matrix of 3000 times 3000;
2.2) selecting the top 200 attribute categories with the highest occurrence frequency in the knowledge data set to obtain attribute matrixes of 3000 target categories, wherein the dimensionality of each attribute matrix is 3000 multiplied by 200, and then calculating the JS divergence of every two categories as the attribute similarity of the categories to obtain an attribute similarity matrix of 3000 multiplied by 3000;
2.3) normalizing the category co-occurrence probability matrix and the attribute similarity matrix according to rows, namely adding the sum of each row to be 1;
2.4) carrying out category replacement on the experimental data sets, namely the category in each experimental data set has a number in a knowledge data set, wherein category co-occurrence probability matrix obtained in the step 2.1) and category information of the attribute similarity matrix obtained in the step 2.2) are represented by synnet;
3) the step of extracting the characteristic of the target area by fast R-CNN:
3.1) performing pre-training on a training data set by using fast R-CNN on gitubs, extracting image features of the training data set through ResNet-101 in a pre-trained model, discarding the last two full-connection layers of ResNet-101, and taking the image features of the second layer from the last of ResNet-101 as the input of the next step;
3.2) generating candidate frames of the target area in the image and the category information of each candidate frame by the image characteristics obtained in the step 3.1) through an RPN network, wherein the category is two categories, namely background information or foreground, the foreground is a target object, and the candidate frames with the overlapping rate exceeding 0.7 are deleted through a non-maximum suppression method;
3.3) uniformly converting the remaining candidate boxes into vectors of 14 times 2048 through a RoI posing layer in the Faster R-CNN, and then inputting the vectors into an additional CNN layer to predict the category of each area box and a refined target area box;
3.4) generating 2048-dimensional feature vectors using average pooling as input to the encoder;
4) the encoder processes the input features stage:
4.1) reducing the 2048-dimensional image feature into 512 dimensions through a full connection layer, and passing the 512-dimensional image feature through a Relu activation layer and a dropout layer;
4.2) converting the input image characteristics into three vectors of Q, K and V through three linear matrixes, and calculating Multi-head, wherein the number of the Multi-head is set to be 8;
4.3) for each of the 8 heads, calculating the visual similarity by the method in the traditional Transformer model to calculate Q and K vector similarity
Figure FDA0003229728970000011
4.4) converting coordinates of each pair of target frames of the image to obtain relative position information lambda (m, n) of the two targets, wherein the lambda (m, n) contains position relevance between the two targets;
4.5) embedding the lambda (m, n) obtained in the step 4.4) by a sinussoid function, and multiplying the embedded lambda (m, n) by a linear conversion layer WGThen obtaining the image position similarity of the target m and the target n through a nonlinear activation function Relu
Figure FDA0003229728970000021
4.6) labeling and storing each target category detected by Faster R-CNN, finding corresponding rows in the category co-occurrence probability matrix and the attribute similarity matrix obtained in the step 2), and obtaining the semantic similarity of every two categories in the image
Figure FDA0003229728970000022
4.7) mixing
Figure FDA0003229728970000023
Fusion by softmax operation
Figure FDA0003229728970000024
Then is that
Figure FDA0003229728970000025
Adding a focus coefficient a, such that
Figure FDA0003229728970000026
The attention degree is (1-a), and a proper value a is obtained through a training process subsequently to obtain the similarity w of every two categories fusing the visual information, the position information and the external knowledgemn
4.8) similarity matrix w calculated using the Q, K vectorsmnMultiplying by the V vectors to obtain weighted regional characteristics fused into the target relationship in the image, and then connecting the 8V vectors;
4.9) inputting the V vector obtained in the step 4.8) into a feedforward neural network formed by two fully-connected layers after residual connection layers and normalization, then inputting the output of the feedforward neural network into the next layer of the encoder after the residual connection layers and normalization, and transmitting the output to the decoder after the operation of 6 layers of encoders in total;
5) the decoder processes the output stage from the encoder:
5.1) carrying out position coding on word information of a group route description sentence corresponding to each training picture in the training data set;
5.2) obtaining a weighted sentence word feature vector by the position-coded word vector obtained in the step 5.1) through Masked Multi-Head attachment, and taking the weighted sentence word feature vector as a V vector of Multi-Head self-Attention in the next step of the first layer;
5.3) converting the output of the last layer of the encoder into Q and K vectors through two linear conversion layers, and then carrying out multi-head self-attention operation on the Q and K vectors and the V vectors obtained in the step 5.2) to obtain the V vectors after the similarity information is fused;
5.4) transmitting the V vector obtained in the step 5.3) to a feedforward neural network after residual connecting layer and normalization, and taking the output as the input of the next layer of the decoder after one residual connecting layer and normalization;
5.5) the second layer decoder layer does not have Masked Multi-Head orientation operation as the first layer, but directly carries out Multi-Head self-Attention calculation, and the Q, K and V vectors are all the results output by the decoder of the previous layer and transformed by three linear matrixes;
5.6) after the operation of a total 6-layer decoder layer, the output vector passes through a linear layer and a softmax layer to obtain the corresponding probability vector of the next word;
6) the test image describes a sentence stage:
6.1) inputting a test set image, extracting image target area characteristics from the trained model and calculating similarity;
6.2) taking the image characteristics weighted by the similarity coefficient as the input of an encoder-decoder framework, and gradually outputting the description sentence word probability of each decoded image;
6.3) selecting the beamsize to be 2 by adopting a beamsearch method, finally obtaining the evaluation index score of each output sentence, and selecting the highest score as the test result.
CN202110982666.1A 2021-08-25 2021-08-25 Image description generation method based on relationship between external knowledge and target Active CN113609326B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110982666.1A CN113609326B (en) 2021-08-25 2021-08-25 Image description generation method based on relationship between external knowledge and target

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110982666.1A CN113609326B (en) 2021-08-25 2021-08-25 Image description generation method based on relationship between external knowledge and target

Publications (2)

Publication Number Publication Date
CN113609326A true CN113609326A (en) 2021-11-05
CN113609326B CN113609326B (en) 2023-04-28

Family

ID=78341994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110982666.1A Active CN113609326B (en) 2021-08-25 2021-08-25 Image description generation method based on relationship between external knowledge and target

Country Status (1)

Country Link
CN (1) CN113609326B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417046A (en) * 2022-03-31 2022-04-29 腾讯科技(深圳)有限公司 Training method of feature extraction model, image retrieval method, device and equipment
CN116012685A (en) * 2022-12-20 2023-04-25 中国科学院空天信息创新研究院 Image description generation method based on fusion of relation sequence and visual sequence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190303458A1 (en) * 2018-04-02 2019-10-03 International Business Machines Corporation Juxtaposing contextually similar cross-generation images
CN111160467A (en) * 2019-05-31 2020-05-15 北京理工大学 Image description method based on conditional random field and internal semantic attention
CN112784848A (en) * 2021-02-04 2021-05-11 东北大学 Image description generation method based on multiple attention mechanisms and external knowledge
CN113220891A (en) * 2021-06-15 2021-08-06 北京邮电大学 Unsupervised concept-to-sentence based generation confrontation network image description algorithm
CN113298151A (en) * 2021-05-26 2021-08-24 中国电子科技集团公司第五十四研究所 Remote sensing image semantic description method based on multi-level feature fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190303458A1 (en) * 2018-04-02 2019-10-03 International Business Machines Corporation Juxtaposing contextually similar cross-generation images
CN111160467A (en) * 2019-05-31 2020-05-15 北京理工大学 Image description method based on conditional random field and internal semantic attention
CN112784848A (en) * 2021-02-04 2021-05-11 东北大学 Image description generation method based on multiple attention mechanisms and external knowledge
CN113298151A (en) * 2021-05-26 2021-08-24 中国电子科技集团公司第五十四研究所 Remote sensing image semantic description method based on multi-level feature fusion
CN113220891A (en) * 2021-06-15 2021-08-06 北京邮电大学 Unsupervised concept-to-sentence based generation confrontation network image description algorithm

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417046A (en) * 2022-03-31 2022-04-29 腾讯科技(深圳)有限公司 Training method of feature extraction model, image retrieval method, device and equipment
CN116012685A (en) * 2022-12-20 2023-04-25 中国科学院空天信息创新研究院 Image description generation method based on fusion of relation sequence and visual sequence
CN116012685B (en) * 2022-12-20 2023-06-16 中国科学院空天信息创新研究院 Image description generation method based on fusion of relation sequence and visual sequence

Also Published As

Publication number Publication date
CN113609326B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
Dong et al. Predicting visual features from text for image and video caption retrieval
Zhang et al. Image captioning with transformer and knowledge graph
Karpathy et al. Deep visual-semantic alignments for generating image descriptions
CN112990296B (en) Image-text matching model compression and acceleration method and system based on orthogonal similarity distillation
Li et al. Recurrent attention and semantic gate for remote sensing image captioning
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN110516530A (en) A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature
CN113094484A (en) Text visual question-answering implementation method based on heterogeneous graph neural network
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN113297370A (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN113743099A (en) Self-attention mechanism-based term extraction system, method, medium and terminal
Puscasiu et al. Automated image captioning
CN114254645A (en) Artificial intelligence auxiliary writing system
CN116029305A (en) Chinese attribute-level emotion analysis method, system, equipment and medium based on multitask learning
CN115062174A (en) End-to-end image subtitle generating method based on semantic prototype tree
Deng et al. A position-aware transformer for image captioning
CN117437317A (en) Image generation method, apparatus, electronic device, storage medium, and program product
CN114661874B (en) Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels
CN116434058A (en) Image description generation method and system based on visual text alignment
CN116579347A (en) Comment text emotion analysis method, system, equipment and medium based on dynamic semantic feature fusion
CN116127954A (en) Dictionary-based new work specialized Chinese knowledge concept extraction method
CN113642630A (en) Image description method and system based on dual-path characteristic encoder
Alabduljabbar et al. Image Captioning based on Feature Refinement and Reflective Decoding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant