CN114547235B - Construction method of image text matching model based on priori knowledge graph - Google Patents

Construction method of image text matching model based on priori knowledge graph Download PDF

Info

Publication number
CN114547235B
CN114547235B CN202210060418.6A CN202210060418A CN114547235B CN 114547235 B CN114547235 B CN 114547235B CN 202210060418 A CN202210060418 A CN 202210060418A CN 114547235 B CN114547235 B CN 114547235B
Authority
CN
China
Prior art keywords
text
image
layer
word
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210060418.6A
Other languages
Chinese (zh)
Other versions
CN114547235A (en
Inventor
郭军
解煜晨
肖云
任鹏真
任哲
王淑文
董智强
许鹏飞
陈晓江
房鼎益
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NORTHWEST UNIVERSITY
Original Assignee
NORTHWEST UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NORTHWEST UNIVERSITY filed Critical NORTHWEST UNIVERSITY
Priority to CN202210060418.6A priority Critical patent/CN114547235B/en
Publication of CN114547235A publication Critical patent/CN114547235A/en
Application granted granted Critical
Publication of CN114547235B publication Critical patent/CN114547235B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a construction method of an image text matching model based on a priori knowledge graph, wherein the constructed model comprises a priori knowledge graph module, an image text matching module and an integration module; the priori knowledge graph module and the image text matching module are respectively connected with the integration module. The method has the advantages that the external priori knowledge graph is constructed to guide the image text matching, the understanding capability of the model to the real scene is greatly enhanced, the graph convolution is utilized to construct the relationship between the priori knowledge graphs, the local attention relationship between all image areas and text fragments is calculated instead of the cross attention mechanism pair-wise, the calculated amount and the parameter amount are reduced, and the training speed and the reasoning speed of the model are improved; aggregating attention relationships between image regions using a self-attention mechanism transfomer; extracting text feature vectors by using a pre-training model BERT, and then utilizing an attention mechanism to aggregate the attention relation among words in the text vectors; the accuracy of image text matching is effectively improved.

Description

Construction method of image text matching model based on priori knowledge graph
Technical Field
The invention relates to the field of computer vision and natural language processing, in particular to a method for constructing an image text matching model based on a priori knowledge graph.
Background
Vision and language are the most important information of two modes of the outside world, and a great deal of popular applications need to combine the two modes, for example: man-machine interaction, advertisement recommendation system and search engine; image text matching is a key technique for achieving the task, and aims to measure semantic similarity between an image and text, specifically, a model needs to search a text database for the text most relevant to the image, or a text is given, and the model needs to search the image database for the image most relevant to the text.
In recent years, the image text matching method based on deep learning has made a great breakthrough, and can be roughly divided into a one-to-one global matching method and a many-to-many fine granularity matching method; the one-to-one global matching method generally extracts global feature representations of images and sentences, then embeds the global features of the images and sentences into a joint space, and measures their similarity by feature distance of the joint space.
In order to solve the problems, a multi-to-multi fine granularity matching method is proposed, and most of the method adopts a stacked cross attention mechanism to find out all alignment relations between local significant areas in pictures and words in sentences, so that good performance is achieved, but the method exhausts calculation power to calculate similarity between all possible image areas and text fragments, consumes huge calculation power and reduces the reasoning speed of a model.
In recent years, external priori knowledge has been applied to many mainstream deep learning tasks, the inference speed of the model can be accelerated by effectively utilizing the priori knowledge, and constructing an effective priori knowledge relation graph can provide abundant priori scene information for the model, expand semantic concepts of the model and enhance generalization capability of the model, however, the external priori knowledge is not well utilized in the field of image text matching. Simultaneously, the semantic relation in the text aspect can be better captured by using the novel pre-training model BERT and the 1D-CNN.
Disclosure of Invention
Aiming at the defects of the existing image text matching method, the invention aims to provide a method for constructing an image text matching model based on a priori knowledge graph, and the model constructed by the method can be used for improving the accuracy and the reasoning speed of the image text matching model.
In order to achieve the above task, the present invention adopts the following technical solutions:
the method for constructing the image text matching model based on the prior knowledge graph is characterized in that the constructed image text matching model based on the prior knowledge graph comprises a prior knowledge graph module, an image text matching module and an integration module; the prior knowledge graph module and the image text matching module are respectively connected with the integration module, and the specific construction steps are as follows:
step 1, constructing a priori knowledge graph module:
extracting meaningful words from a text corpus by using a statistical method, performing word embedding operation on the extracted words by using a glove technology, and representing the words as word feature vectors, which are called prior knowledge; constructing a relation diagram of priori knowledge according to the co-occurrence statistical probability of words in the corpus; learning the interdependencies between prior knowledge using graph convolution;
step 2, constructing an image text matching module:
after the image data and the text data are given, obtaining an image feature vector by using a pre-trained fast-RCNN model, and obtaining a text feature vector by using a pre-trained BERT model; carrying out intra-mode context information aggregation on the image feature vector by using a self-attention mechanism to obtain a first layer of image features; carrying out intra-mode context information aggregation on the text feature vector by using a self-attention mechanism to obtain a first-layer text feature;
step 3, construction of an integration module:
the prior knowledge learned by the graph convolution is utilized to guide the first-layer image features and the first-layer text features, and the second-layer image features and the second-layer text features guided by the prior knowledge graph are output; the second layer image features and the first layer image features are weighted and combined to obtain third layer image features of the integration module; the second layer text feature and the first layer text feature are weighted and combined to obtain a third layer text feature of the integration module;
step 4, constructing a loss function by utilizing the first layer image text characteristics and the third layer image text characteristics;
and 5, training and testing to obtain an image text matching model based on the priori knowledge graph.
Further, in step 1, the constructing the prior knowledge graph module further includes:
the extracting words from the text corpus comprises: deleting rare words from a text corpus, and selecting words with three parts of speech, verbs and adjectives; and according to the statistical frequency of words in the corpus, the proportion of nouns, verbs and adjectives selected is strictly limited to 7:2:1, word embedding operation is carried out on the selected words by utilizing a glove technology, the words are expressed as word feature vectors, and the word feature vectors are called as priori knowledge.
The construction of the prior knowledge relation graph comprises the following steps: modeling a relation diagram in the form of a conditional probability matrix, wherein the specific format is as follows:
in which W is i Representing the number of occurrences of word i in the corpus, W ij Representing words i and j in a corpusTimes co-occurring in a text, P ij Representing the probability of co-occurrence of word i and word j;
the convolution includes: and taking the word feature vector obtained by the glove technology as a node, taking the constructed prior knowledge relation graph as a correlation matrix, inputting the correlation matrix into a graph convolution network for training, and finally obtaining the feature vector of the prior knowledge graph.
Specifically, in step 2, the construction of the image text matching module further includes:
the image data and text data feature extraction includes: extracting 36 salient regions of each input image by using a Faster-RCNN pre-training model, and representing each salient region as an image feature vector through a full connection layer; extracting feature vectors of each text by using a BERT pre-training model, wherein the text feature vectors output by the BERT aggregate word segmentation features, semantic features and position features;
the self-attention mechanism includes: aggregating the attention relation among the image areas by using a transducer model, obtaining three inputs Q, K, V of a transducer by using three different full-connection layers of specific image area-level feature vectors, and finally obtaining the first layer of image features after the transducer is aggregated; an implementation of the text self-attention mechanism is: the context information of the sentences is explored by using three one-dimensional convolution networks with different sizes, so that the information of phrases with different lengths in the sentences can be captured, and finally, the text characteristics of the first layer are obtained.
Preferably, the implementation method for constructing the loss function by using the text features of the first layer image and the text features of the third layer image comprises the following steps:
using a triplet loss function, the triplet loss function basic formula is:
where alpha is a predefined edge parameter, S is a similarity function (e.g., cosine similarity function) of the image text pairs, S (Ω, T) represents a forward matching image text pair similarity score,and->Representing similarity scores for reverse matches from image to text and from text to image, respectively;
in the experiment, only small batches of reverse matching pairs are used, instead of accumulating all reverse samples, and a triple loss function is applied to the first layer image text feature pair and the third layer image text feature pair; in addition, the relative entropy is added on the importance score of the semantic concept and is used for further enhancing the image text similarity measurement, and the final loss function formula is as follows:
wherein lambda is 1 ,λ 2 ,λ 3 Is a weight parameter that balances the different losses.
Compared with the prior art, the method for constructing the image text matching model based on the priori knowledge graph has the following technical effects:
1. an external priori knowledge graph is constructed to guide image text matching, so that semantic concepts of the model are expanded, the understanding capability of the model to a real scene is greatly enhanced, and the model has good generalization capability; meanwhile, the relation between priori knowledge graphs is constructed by utilizing the graph convolution mode, so that the local attention relation between all image areas and text fragments is calculated instead of using a cross attention mechanism pair by pair, the calculated amount and parameter amount of the model are greatly reduced, and the training speed and the reasoning speed of the model are improved.
2. Aggregating attention relationships between image regions using a new self-attention mechanism transfomer; extracting text feature vectors by using a new pre-training model BERT, and then aggregating the attention relation between words in the text vectors by using an attention mechanism; in this way, both image and text modalities are well characterized.
Drawings
FIG. 1 is a schematic flow diagram of a method for constructing an image text matching model based on a priori knowledge map;
FIG. 2 is a block diagram of a constructed prior knowledge graph-based image text matching model structure;
FIG. 3 is a schematic diagram of a self-attention mechanism transducer;
the invention is described in further detail below with reference to the drawings and examples.
Detailed Description
Referring to fig. 1 and fig. 2, the present embodiment provides a method for constructing an image text matching model based on a priori knowledge map, where the constructed image text matching model based on the priori knowledge map includes a priori knowledge map module, an image text matching module and an integration module; the prior knowledge graph module and the image text matching module are respectively connected with the integration module, and the specific construction steps are as follows:
step 1, constructing a priori knowledge graph module:
the method utilizes a statistical method to extract meaningful words from a text corpus, and the specific extraction strategy is as follows: deleting rare words in a text corpus, selecting names, verbs and adjectives with high occurrence frequency as priori knowledge, strictly limiting the ratio of three parts of speech to 7:2:1 according to the statistical probability of the three parts of speech in the corpus, marking the selected words as word labels, and recording as W tag
Word embedding operation is carried out on the extracted words by utilizing a glove technology, the words are expressed as word feature vectors, the word feature vectors are expressed by a matrix K, and the K is called as priori knowledge;
according to the statistical probability of the co-occurrence of words in the corpus, constructing a priori knowledge relation graph in the form of a conditional probability matrix, wherein the specific formula is as follows:
in which W is i Representing the number of times word i appears in the corpus,W ij representing the number of times word i and word j co-occur in a text in the corpus, then P ij Representing the probability of co-occurrence of word i and word j; however, there is a deviation between the co-occurrence relationship among the words counted in the corpus and the real scene situation, which causes the model to be excessively fitted on the training set to affect the generalization capability of the model, in order to avoid the above problem, the matrix P is binarized, specifically, the threshold value ψ is used to filter out the edge noise, and the final prior knowledge relationship graph a ij The method comprises the following steps:
the basic idea of graph convolution, which utilizes graph convolution to learn the interdependence relationship between priori knowledge, is to continuously update node characteristic representations by propagating neighborhood information between nodes, specifically: given priori knowledge K as node information of graph convolution and a priori knowledge relation graph A ij As a correlation matrix of graph convolution, the calculation process of the first layer of graph convolution is as follows:
wherein H is 0 =K,Is to use the relation matrix A ij Normalized symmetric matrix, W l Is a transfer matrix to be learned, sigma is a nonlinear activation function ReLU;
obtaining output H of last layer of graph convolution F And obtaining the final prior knowledge graph feature vector.
Step 2, constructing an image text matching module:
after the image data and the text data are given, the pre-trained Faster-RCNN model is utilized to obtain the image feature vector omega= { omega 1 ,ω 2 ,…,ω n },ω i Is the region of the ith region in a pictureFeature vectors. Obtaining text feature vector t= { T by using pre-trained BERT model 1 ,t 2 ,…,t e },t i Is the feature vector of the i-th word in a text;
for the image feature vector Ω, intra-modality context information aggregation is performed on the image feature vector using a transducer self-attention mechanism, as shown in fig. 3, where the transducer self-attention formula is:
to further enhance the characterizability of the model, the transducer aggregates context information from different subspaces using multiple parallel self-attention formulas;
head i =Attention(Q i ,K i ,V i )
wherein H is i Represents the output of the ith header, Q i ,K i ,V i The representing results of the image feature vector omega passing through different full connection layers are specifically:
then the transducer model will splice a plurality of heads to obtain the first layer image characteristic omega 1 The method comprises the following steps:
Ω 1 =MultiHead(Ω)=concat(head 1 ,head 2 ,...,head n )W O
for the text feature vector T, three one-dimensional convolution networks with different sizes are utilized to explore the context information of sentences, and the specific mode is as follows:
the one-dimensional convolution formula for applying a convolution kernel m to the kth word is:
p m,k =ReLU(W m t k:k+m-1 +b m ),m∈{1,2,3}
wherein W is m Is a learnable convolution kernel parameter, b m Is biased toPut items, t k:k+m-1 Is the eigenvector of the kth to the kth+m-1 word;
after the one-dimensional convolution output is obtained, the maximum pooling operation is used for words at all positions:
q m =max{p m,1 ,p m,2 ,...,p s,e },m∈{1,2,3}
final splice q 1 ,q 2 ,q 3 And through a fully-connected layer and appliedRegularization term to finally obtain first layer image feature T 1 The method comprises the following steps:
step 3, construction of an integration module:
the prior knowledge learned by the graph convolution is utilized to guide the first-layer image features and the first-layer text features, and the second-layer image features and the second-layer text features guided by the prior knowledge graph are output;
the guided second layer image features are specifically as follows:
wherein the method comprises the steps ofIs a priori knowledge graph feature vector +.>With respect to the first layer image feature vector Ω 1 Is the importance score of λ is the smoothing parameter of the softmax function, Ω 2 Is the final calculated second layer image feature vector;
for text, since the prior knowledge graph is obtained from the text corpus, the first layer text features are guided by statistics, the word labels after screening and the prior knowledge learned by graph convolution together, specifically:
wherein W is i tag For the word tag to be a word tag,is a priori knowledge feature map vector +.>With respect to the first layer text feature vector T 1 Is the importance score of (a), λ is the smoothing parameter of the softmax function, ω is the optimization parameter controlling the word label and the a priori knowledge ratio, T 2 Is the final calculated second layer image feature vector.
The second layer image features and the first layer image features are weighted and combined to obtain third layer image features of the integration module; and carrying out weighted combination on the second layer text features and the first layer text features to obtain third layer text features of the integration module, wherein the third layer text features comprise:
Ω 3 =δΩ 1 +(1-δ)Ω 2
T 3 =δT 1 +(1-δ)T 2
where δ is an optimization parameter controlling the ratio of the first layer features to the second layer features, Ω 3 ,T 3 The third layer image feature vector and the third layer text feature vector, respectively.
Step 4, constructing a loss function by utilizing the first layer image text characteristics and the third layer image text characteristics; the loss function used is a general triplet loss function in the field of image text matching, and specifically comprises the following steps:
where alpha is a predefined edge parameter, S is a similarity function (e.g., cosine similarity function) of the image text pairs, S (Ω, T) represents a forward matching image text pair similarity score,and->Representing similarity scores for reverse matches from image to text and from text to image, respectively;
in experiments, only small batches of reverse matching pairs were used, rather than accumulating all of the reverse samples, and the triplet loss function was applied to the first layer image text feature pairs and the third layer image text feature pairs. To further enhance the measure of image text similarity, the relative entropy is applied on the importance score of the semantic concept and added to the final loss function, where the relative entropy formula is:
the final loss function formula is:
and 5, training and testing to obtain an image text matching model based on the priori knowledge graph.
The following are specific examples given by the inventors.
Examples:
experiment platform: the experiment was run on a NVIDIA TITAN RTX workstation using the Pytorch framework tool.
Data set: the experiment adopts two reference data sets widely used in the field of image text matching: flickr30k dataset and MSCOCO dataset; flickr30k contains 31783 pictures, each picture contains 5 text subtitles, 1000 pictures are used as a verification set, 1000 pictures are used as a test set, and the rest are used for a training set; MSCOCO contains 123287 pictures, each containing 5 artificial text descriptions, using 113287 pictures for training, 5000 for verification, 5000 for testing, and MSCOCO is split into a 1K capacity dataset and a 5K capacity dataset.
The inventors performed a correlation experiment on the MSCOCO1K dataset and the Filckr30K dataset; table 1 shows the comparison result of the image text matching model based on the prior knowledge graph constructed in this embodiment and the image text matching model excellent in recent years.
Table 1: comparison of model results on MSCOCO1K and Flickr30K datasets
As can be demonstrated from Table 1, the image text matching model based on the prior knowledge graph constructed in the embodiment effectively improves the accuracy of the image text matching method, and meanwhile, the constructed image text matching model based on the prior knowledge graph has faster training and reasoning speed because the construction method does not have the paired local attention relation between the calculated image area and the text segment.

Claims (4)

1. The method for constructing the image text matching model based on the prior knowledge graph is characterized in that the constructed image text matching model based on the prior knowledge graph comprises a prior knowledge graph module, an image text matching module and an integration module; the prior knowledge graph module and the image text matching module are respectively connected with the integration module, and the specific construction steps are as follows:
step 1, constructing a priori knowledge graph module:
extracting meaningful words from a text corpus by using a statistical method, performing word embedding operation on the extracted words by using a glove technology, and representing the words as word feature vectors, which are called prior knowledge; constructing a priori knowledge relation graph according to the co-occurrence statistical probability of the words in the corpus; learning the interdependencies between prior knowledge using graph convolution;
step 2, constructing an image text matching module:
after the image data and the text data are given, obtaining an image feature vector by using a pre-trained fast-RCNN model, and obtaining a text feature vector by using a pre-trained BERT model; carrying out intra-mode context information aggregation on the image feature vector by using a self-attention mechanism to obtain a first layer of image features; carrying out intra-mode context information aggregation on the text feature vector by using a self-attention mechanism to obtain a first-layer text feature;
step 3, construction of an integration module:
the prior knowledge learned by the graph convolution is utilized to guide the first-layer image features and the first-layer text features, and the second-layer image features and the second-layer text features guided by the prior knowledge graph are output;
the second layer image features and the first layer image features are weighted and combined to obtain third layer image features of the integration module;
the second layer text feature and the first layer text feature are weighted and combined to obtain a third layer text feature of the integration module;
step 4, constructing a loss function by utilizing the first layer image text characteristics and the third layer image text characteristics;
and 5, training and testing to obtain an image text matching model based on the priori knowledge graph.
2. The method of building as claimed in claim 1, wherein in step 1, the building of the prior knowledge graph module further comprises:
the extracting words from the text corpus comprises: deleting rare words from a text corpus, and selecting words with three parts of speech, verbs and adjectives; and according to the statistical frequency of words in the corpus, the proportion of nouns, verbs and adjectives selected is strictly limited to 7:2:1, performing word embedding operation on a selected word by utilizing a glove technology, and representing the word as a word feature vector, which is called as priori knowledge;
the construction of the prior knowledge relation graph comprises the following steps: modeling a relation diagram in the form of a conditional probability matrix, wherein the specific formula is as follows:
in which W is i Representing the number of occurrences of word i in the corpus, W ij Representing the number of times word i and word j co-occur in a text in the corpus, then P ij Representing the probability of co-occurrence of word i and word j;
the convolution includes: and taking the word feature vector obtained by the glove technology as a node, taking the constructed prior knowledge relation graph as a correlation matrix, inputting the correlation matrix into a graph convolution network for training, and finally obtaining the feature vector of the prior knowledge graph.
3. The method of claim 1, wherein in step 2, the construction of the image text matching module further comprises:
the image data and text data feature extraction includes: extracting 36 salient regions of each input image by using a Faster-RCNN pre-training model, and representing each salient region as an image feature vector through a full connection layer; extracting feature vectors of each text by using a BERT pre-training model, wherein the text feature vectors output by the BERT aggregate word segmentation features, semantic features and position features;
the self-attention mechanism includes: aggregating the attention relation among the image areas by using a transducer model, obtaining three inputs Q, K, V of a transducer by using three different full-connection layers of specific image area-level feature vectors, and finally obtaining the first layer of image features after the transducer is aggregated; an implementation of the text self-attention mechanism is: the context information of the sentences is explored by using three one-dimensional convolution networks with different sizes, so that the information of phrases with different lengths in the sentences can be captured, and finally, the text characteristics of the first layer are obtained.
4. The construction method according to claim 1, wherein in step 4, the construction of the loss function using the first layer image text feature and the third layer image text feature is implemented by:
a triplet loss function is used, which has the basic formula:
where alpha is a predefined edge parameter, S is a similarity function for the image text pairs, S (Ω, T) represents a similarity score for the image text pairs that are forward matched,and->Representing similarity scores for reverse matches from image to text and from text to image, respectively;
in the experiment, using small batches of reverse matching pairs, a triple loss function is applied to the first layer image text feature pair and the third layer image text feature pair;
the relative entropy is added on the importance score of the semantic concept and is used for further enhancing the similarity measurement of the image text, and the final loss function formula is as follows:
wherein lambda is 1 ,λ 2 ,λ 3 Is a weight parameter that balances the different losses.
CN202210060418.6A 2022-01-19 2022-01-19 Construction method of image text matching model based on priori knowledge graph Active CN114547235B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210060418.6A CN114547235B (en) 2022-01-19 2022-01-19 Construction method of image text matching model based on priori knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210060418.6A CN114547235B (en) 2022-01-19 2022-01-19 Construction method of image text matching model based on priori knowledge graph

Publications (2)

Publication Number Publication Date
CN114547235A CN114547235A (en) 2022-05-27
CN114547235B true CN114547235B (en) 2024-04-16

Family

ID=81672097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210060418.6A Active CN114547235B (en) 2022-01-19 2022-01-19 Construction method of image text matching model based on priori knowledge graph

Country Status (1)

Country Link
CN (1) CN114547235B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
CN111741236A (en) * 2020-08-24 2020-10-02 浙江大学 Method and device for generating positioning natural image subtitles based on consensus diagram characteristic reasoning
CN112084358A (en) * 2020-09-04 2020-12-15 中国石油大学(华东) Image-text matching method based on regional enhanced network with theme constraint
CN113191357A (en) * 2021-05-18 2021-07-30 中国石油大学(华东) Multilevel image-text matching method based on graph attention network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020143137A1 (en) * 2019-01-07 2020-07-16 北京大学深圳研究生院 Multi-step self-attention cross-media retrieval method based on restricted text space and system
CN111741236A (en) * 2020-08-24 2020-10-02 浙江大学 Method and device for generating positioning natural image subtitles based on consensus diagram characteristic reasoning
CN112084358A (en) * 2020-09-04 2020-12-15 中国石油大学(华东) Image-text matching method based on regional enhanced network with theme constraint
CN113191357A (en) * 2021-05-18 2021-07-30 中国石油大学(华东) Multilevel image-text matching method based on graph attention network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
结合视觉特征和场景语义的图像描述生成;李志欣;魏海洋;黄飞成;张灿龙;马慧芳;史忠植;;计算机学报;20200915(09);全文 *

Also Published As

Publication number Publication date
CN114547235A (en) 2022-05-27

Similar Documents

Publication Publication Date Title
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
CN110209836B (en) Remote supervision relation extraction method and device
CN110134946B (en) Machine reading understanding method for complex data
CN111414461B (en) Intelligent question-answering method and system fusing knowledge base and user modeling
WO2022141878A1 (en) End-to-end language model pretraining method and system, and device and storage medium
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN114298158A (en) Multi-mode pre-training method based on image-text linear combination
CN113297364B (en) Natural language understanding method and device in dialogue-oriented system
CN110619121A (en) Entity relation extraction method based on improved depth residual error network and attention mechanism
CN114818703B (en) Multi-intention recognition method and system based on BERT language model and TextCNN model
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN112307048B (en) Semantic matching model training method, matching method, device, equipment and storage medium
CN113378573A (en) Content big data oriented small sample relation extraction method and device
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN112100212A (en) Case scenario extraction method based on machine learning and rule matching
CN115544303A (en) Method, apparatus, device and medium for determining label of video
CN114612767A (en) Scene graph-based image understanding and expressing method, system and storage medium
CN114818719A (en) Community topic classification method based on composite network and graph attention machine mechanism
CN114332519A (en) Image description generation method based on external triple and abstract relation
CN117648984A (en) Intelligent question-answering method and system based on domain knowledge graph
CN116186350B (en) Power transmission line engineering searching method and device based on knowledge graph and topic text
CN114547235B (en) Construction method of image text matching model based on priori knowledge graph
CN114881038B (en) Chinese entity and relation extraction method and device based on span and attention mechanism
CN114386425B (en) Big data system establishing method for processing natural language text content
Ronghui et al. Application of Improved Convolutional Neural Network in Text Classification.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant