CN115248877B - Multi-mode-based track text matching method - Google Patents
Multi-mode-based track text matching method Download PDFInfo
- Publication number
- CN115248877B CN115248877B CN202211155406.8A CN202211155406A CN115248877B CN 115248877 B CN115248877 B CN 115248877B CN 202211155406 A CN202211155406 A CN 202211155406A CN 115248877 B CN115248877 B CN 115248877B
- Authority
- CN
- China
- Prior art keywords
- vector
- track
- node
- word
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/732—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Library & Information Science (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a multi-mode-based track text matching method, which comprises the following steps: extracting track features and calculating track feature matrixes of different semantic layers; performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN model, and updating the nodes based on an attention mechanism to obtain word feature vectors; and calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score. The text is divided into different semantic levels to be independently coded, so that the detail information of the text is reserved, the track and the topological structure in the text are also reserved, and the accuracy of matching retrieval is improved; meanwhile, an attention mechanism and a parameter sharing method are introduced, so that the number of calculation parameters is reduced, and the calculation efficiency is improved.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a multi-mode-based track text matching method.
Background
Today, as information technology has rapidly developed, multimodal data has become the main form of data resources in recent times. The multi-modal data has the characteristics of multi-source isomerism and internal semantic consistency in the description of the same object. The different modality forms each describe a feature of the object at a particular angle. The method has important value in the research of the multi-mode learning method and the capability of solving multi-source heterogeneous mass data by a computing mechanism. In recent years, a number of highly effective multimodal deep learning algorithms have been applied to practical scenes. For example, the image vector retrieval of the video frame and the face frame which are used by the Youkou video can not be supported by the multi-modal recognition technology due to the 'only see TA' function proposed by the Aichi art. The technologies of 'shooting and panning' and the like of each large e-commerce platform use information fusion under multiple modes such as images, texts, high-level semantic attributes and the like to realize the function of 'searching images with images' with high precision.
The existing cross-modal retrieval method for video texts mainly comprises two methods:
the first is a global coding method, which extracts video and text features through different algorithms, codes the video and corresponding text feature vectors into an integral vector, and calculates the distance between the two vectors in a common space according to a defined distance function to measure the similarity between the video and the text. However, the global coding method directly codes the video and the corresponding text into an integral vector, and this coding method may lose the detail information in the video and the text, and a section of text may contain a plurality of actions, each action corresponding to a different entity. The details may be lost by simply embedding the text into a vector, and the corresponding relationship between the video and the text cannot be better understood, so that the model has a poor effect.
The second is a local encoding method, each frame of the video and each word of the text are encoded into a vector, the corresponding similarity of each word and the video frame is calculated, and the similarity is summed to be the overall similarity. However, the local coding mode greatly increases the calculation parameters of the model, so that the learning and matching detection efficiency is lower; while these sequential vectors representing video and text ignore the topology in video and text, such that vector matches may be trapped in local matches and global matches ignored.
In order to overcome the defects of the prior art, the invention aims at the problems that the recommendation of proper natural language description from massive track data is urgent to solve in the next large data processing, and provides a track text matching method based on multiple modes.
Disclosure of Invention
In view of the above, the invention provides a multi-mode-based track text matching method, which divides a text into three semantic levels to perform independent coding respectively, so that not only is the detail information of the text kept, but also the track and the topological structure in the text are kept, and the accuracy of matching retrieval is improved; meanwhile, the attention mechanism and the parameter sharing method are introduced, so that the number of calculation parameters is reduced, and the calculation efficiency is improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-modal-based track text matching method comprises the following steps:
s1: extracting track features and calculating track feature matrixes of different semantic layers;
s2: performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN model, and updating the nodes based on an attention mechanism to obtain word feature vectors;
s3: and calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score.
Preferably, S1 comprises:
s11: extracting three types of dimensional information data of the track data, and performing weighted fusion on the extracted three types of dimensional information data;
s12: calculating the characteristic vector of each track point in one piece of track information, aggregating the characteristic vectors of all track points into a global vector on a global layer, aggregating the characteristic vectors of all track points into an action vector on an action layer, aggregating the characteristic vectors of all track points into an entity vector on a physical layer, and obtaining track characteristic matrixes of different semantic layers.
Preferably, S11 specifically includes:
the three types of dimension information comprise a track space dimension, a track state dimension and a track attribute dimension;
the calculation formula of three types of dimensional information data for extracting the track data is as follows:
the weighted fusion formula is:
whereinRepresenting the learning result of three kinds of dimensional information data,three separate learning parameter matrices are shown,represents three types of dimensional information data,in order to be a function of fusion,representing different fusion parameters.
Preferably, S12 specifically includes:
s121: calculating the characteristic vector of each track point in one piece of track information,A vector group e of one track = [ ],...]M represents the number of trace points in a trace, e is a vector group of a trace,vector of the trace point;
s122: for matching with texts with different semantic levels, three independent weight parameters are adopted,Andprocessing the characteristic vectors of the track points to obtain three independently expressed track characteristic vector groups, wherein the calculation formula is as follows:
wherein,Andrepresenting three different learnable weight parameter matrices,representing the feature vectors of the trace points of the global layer,the feature vector of the trace point of the action layer is represented,representing the characteristic vector of the trace point of the physical layer;
s123: aggregating the feature vectors of all the track points into a global vector in a global layerAnd aggregating all track point feature vectors into action vectors in the action layer[,...]And aggregating the feature vectors of all the tracing points into a solid vector at a solid layer[,...]And obtaining the trace feature matrixes of different semantic layers.
Preferably, S2 specifically includes:
s21: analyzing the relations among verbs, noun phrases, verbs and corresponding nouns in each sentence;
s22: taking the whole sentence as a root node, taking verbs as action nodes to be connected with the root node, taking noun phrases as entity nodes to be connected with the corresponding action nodes, and taking edges connecting the action nodes with the entity nodes as semantic relations to form different semantic hierarchies;
s23: embedding action nodes and entity nodes in each sentence into a vector based on a bidirectional LSTM model, and recording word vectors of each word in the sentenceW=[,...];
S24: aggregating the word vectors of all words into a global event node vector through an attention mechanism;
s25: obtaining word vectors after vector embedding of action nodes and entity nodes in sentences, and performing maximum pooling operation to correspondingly obtain verb node vectors and entity node vectors;
s26: and constructing a semantic topological graph based on the global event node vector, the verb node vector and the entity node vector, inputting the global event node vector, the verb node vector and the entity node vector into the RGCN model as initialized node vectors, and updating the nodes based on the attention mechanism to obtain word feature vectors.
Preferably, the calculation formula of embedding the action node and the entity node into one vector is as follows:
in the formula (I), the compound is shown in the specification,is an embedded matrix of word vectors that is,is a learnable bias parameter for the forward LSTM network,is a learnable bias parameter for an inverse LSTM network,a word vector representing the word in the forward computing sentence i,representing a word vector that calculates words in sentence i in reverse,a word vector representing the words in the forward computed sentence i-1,representing the word vector of the word in the inverse computation sentence i +1,representing the final word vector representation;
in the formula (I), the compound is shown in the specification,a matrix of learning parameters is represented that is,a vector of words is represented that is,represented in a semantic topological graph withA word vector of words with connection relation, N represents the sum in the whole semantic topological graphThe number of words with connection relation;
verb node vectorVector of physical nodesWhere Na and Ne represent the number of verb nodes and entity nodes, respectively.
Preferably, S26 is specifically:
s261: representing a node vector asAnd inputting the node vector into the RGCN model, wherein the calculation formula is as follows:
whereinIs a one-hot type vector of the semantic relationship corresponding to the nodes i and j,representing the initialized node vector after the initialization of the node vector,is a relational transformation matrix;
s262: suppose that the ith node is atlThe RGCN of a layer has an output ofThe calculation process after the attention mechanism is adopted is as follows:
whereinRepresents a neighbor node of the node i,andis the attention mechanism parameter, D represents the number of nodes,representing the attention coefficient between the ith word and the jth word;
according to the operation rule of the RGCN, the update rule of the node is as follows:
in the formula (I), the compound is shown in the specification,is shown aslA learnable parameter matrix of the +1 layer network;
assuming that the number of layers of the RGCN is L, the node vector output by the L-th layer is the word feature vector, whereinA vector of global nodes is represented that is,a set of motion node vectors is represented,representing a set of entity node vectors.
Preferably, S3 specifically includes:
s31: calculating the similarity of the global vector and the global node through a cosine function in the global layer to obtain a global matching score;
Respectively obtaining an action matching score and an entity matching score by locally matching a track and a text in an action layer and an entity layer;
s32: and carrying out weighted summation on the global matching score, the action matching score and the entity matching score to obtain an overall similarity score.
the specific calculation process of the action matching score and the entity matching score is as follows:
And carrying out normalization processing on the matching score, wherein the calculation formula is as follows:
wherein,Is an attention parameter, which indicates whether the word node i has a relationship with the track node j,representing a learnable neural network parameter;
calculating the matching score of the word node i to the whole track, wherein the calculation formula is as follows:=;
When x is a, the local matching score is the action matching score Sa, and when x is e, the local matching score is the entity matching score Se.
the invention has the following advantages:
(1) The method comprises the steps of dividing all words of a text into three different semantic levels by using a natural language processing technology of semantic annotation, constructing a semantic topological graph according to a semantic role relationship, fully extracting fine-grained information of the text, and simultaneously keeping a topological structure of the text, so that the learned text features are more accurate.
(2) Matching tracks and text from both global and local dimensions, and weighting and summing matching scores. The global is responsible for the matching degree between the track and the whole text and the local is responsible for the matching degree between the track point and the word, so that the interpretability of track text matching is increased while the matching accuracy is improved.
(3) By adopting the idea of multi-mode representation learning and the method of mapping the track and the text to the same vector space, the multi-target interaction information contained in the text is fused on the basis of fully excavating the characteristics of the track, so that the accuracy of track anomaly detection can be effectively improved. Meanwhile, the tracks and the texts are matched from the global dimension and the local dimension, so that the matching accuracy is improved, and the interpretability is provided for the matching result.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of a multi-modal-based track text matching method provided by the invention.
Fig. 2 is a schematic diagram illustrating the track feature extraction provided by the present invention.
Fig. 3 is a schematic diagram illustrating text layering embedding provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The embodiment of the invention discloses a multi-mode-based track text matching method, which comprises the following steps of:
s1: extracting track features and calculating track feature matrixes of different semantic layers;
s2: performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN (reduced graphics context) model, and updating the nodes based on an attention mechanism to obtain word feature vectors;
s3: and calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score.
In this embodiment, as shown in fig. 2, the specific implementation process of S1 is as follows:
s11: three types of dimensional information data of the track data are extracted, the three types of dimensional information comprise track space dimensions, track state dimensions and track attribute dimensions, time, longitude, latitude, height and the like are track space dimensional data, course, speed and the like are track state dimensional data, flight numbers, call signs, departure places, destinations and the like are track attribute dimensional data, the extracted three types of dimensional information data are weighted and fused, and the characteristic extraction formula and the weighted fusion formula are shown as formulas (1), (2), (3) and (4):
whereinRepresents the learning result of the three kinds of dimensional information data,three separate learning parameter matrices are shown,represents three types of dimensional information data,in order to be a function of fusion,representing different fusion parameters;
s12: calculating the characteristic vector of each track point in one track information, aggregating the characteristic vectors of all track points into a global vector in a global layer, aggregating the characteristic vectors of all track points into an action vector in an action layer, aggregating the characteristic vectors of all track points into an entity vector in a physical layer, and obtaining track characteristic matrixes of different semantic layers. The method specifically comprises the following steps:
s121: calculating the characteristic vector of each track point in one piece of track information,A vector group e of one track = [ ],...]M represents the number of track points in a track, e is a vector group of a track,vector of the trace point;
s122: for matching with texts with different semantic levels, three independent weight parameters are used,Andprocessing the characteristic vectors of the track points, wherein the calculation formula is as follows:
wherein,Andrepresenting three different learnable weight parameter matrices,representing the feature vectors of the trace points of the global layer,the feature vector of the trace point of the action layer is represented,representing the characteristic vector of the trace point of the physical layer;
s123: aggregating the feature vectors of all the track points into a global vector in a global layerAnd aggregating all track point feature vectors into action vectors in the action layer[,...]And aggregating the feature vectors of all the tracing points into a solid vector at a solid layer[,...]And obtaining the trace feature matrixes of different semantic layers.
In this embodiment, the track description text naturally contains a hierarchical structure, as shown in FIG. 3. The whole sentence describes the global event, the sentence is composed of a plurality of verbs, and each verb is composed of a corresponding subject and an object. The track description text can be quickly and comprehensively understood through the method which comprises the global description and the local analysis. S2 thus describes in particular how to generate such a hierarchical text representation containing both global descriptions and local analyses:
s21: first, for a sentence C = [, ],...]And N represents the number of words in the sentence, and the whole sentence C is taken as a root node of the graph. Inputting the sentence C into the ready-made semantic analysis tool to obtain the semantic roles of the verbs, the noun phrases, and the noun phrase of each corresponding verb in the sentence.
S22: each verb is used as an action node to be directly connected with a root node, each noun phrase is used as an entity node to be connected with the corresponding action node, and because the action node and the entity node have a corresponding semantic relationship, the edges connected by the action node and the entity node are used for recording the corresponding semantic relationship;
s23: embedding action nodes and entity nodes in each sentence into a vector based on a bidirectional LSTM model, and recording word vectors of each word in the sentenceW=[,...]The calculation results are as follows:
in the formula (I), the compound is shown in the specification,is a word vectorThe embedded matrix of (a) is embedded,is a learnable bias parameter for the forward LSTM network,is a learnable bias parameter for an inverse LSTM network,a word vector representing the word in the forward computing sentence i,representing a word vector that calculates words in sentence i in reverse,a word vector representing the word in the forward computation sentence i-1,representing the word vector of the word in the inverse computation sentence i +1,representing the final word vector representation;
s24: for the root node, because all descriptions of the text need to be obtained, all word vectors are aggregated into one vector through an attention mechanism, and at this time, a global event node vector needs to be calculatedThe calculation process is as follows:
in the formula (I), the compound is shown in the specification,a matrix of learning parameters is represented and,a vector of words is represented that is,represented in a semantic topological graph withA word vector of words with connection relation, N represents the sum in the whole semantic topological graphThe number of words with connection relation;
s25: for verb nodes and entity nodes, corresponding word vectors are mapped on the basis of LSTM embeddingPerforming maximum pooling operation to obtain verb node vectorAnd entity node vectorWherein Na and Ne represent the number of verb nodes and entity nodes, respectively;
s26: and inputting the global event node vector, the verb node vector and the entity node vector as initialized node vectors into the RGCN model, and updating the nodes based on the attention mechanism to obtain word feature vectors.
Specifically, the S26 is implemented as follows:
s261: representing a node vector asAnd inputting the node vector into the RGCN model, wherein the calculation formula is as follows:
whereinIs a one-hot type vector of semantic relation corresponding to the nodes i and j,representing the initialized node vector after the initialization of the node vector,is a relational transformation matrix;
s262: suppose that the ith node is atlThe RGCN of a layer has an output ofThe calculation process after the attention mechanism is adopted is as follows:
whereinRepresenting the neighbor nodes of the node i,andis a parameter of the attention mechanism and,d represents the number of nodes and is,representing an attention coefficient between the ith word and the jth word;
according to the operation rule of the RGCN, the update rule of the node is as follows:
in the formula (I), the compound is shown in the specification,is shown aslA learnable parameter matrix of the +1 layer network;
assuming that the number of layers of the RGCN is L, the node vector output by the L-th layer is a word feature vector;
whereinA vector of global nodes is represented that is,a set of motion node vectors is represented,representing a set of entity node vectors.
The invention converts the matrixThe method is set as the shared parameter, namely all the relations share the conversion matrix, and solves the problems that in the prior art, an R-GCN model needs to calculate a relation matrix and a conversion matrix for each relation, so that a large number of parameters needing to be calculated are generated, the calculation efficiency of the model is seriously influenced, and overfitting in the training process is caused.
In this embodiment, the specific implementation process of S3 is as follows:
s31: at the global levelBecause the track and the text are coded into a one-dimensional global vector, the similarity of the two vectors can be measured by directly using a cosine function, and the similarity of the global vector and the global node is calculated through the cosine function to obtain a global matching score=;
At the action and entity level, matching of tracks and texts needs to be performed locally. For each text node vectorWherein x is{ a, e }, vector of each trace pointWherein x is{ a, e }; first, a pair of local cross-modal matching scores is calculated=. In order to show which track points in the word bottom and the track have relations, the score needs to be normalized, and the formula is as follows:
wherein。In effect, an attention parameter, which indicates whether the word node i has a relationship with the track node j,representing a learnable neural network parameter.
S32: weighting and summing the global matching score, the action matching score and the entity matching score to obtain an overall similarity score。
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (9)
1. A multi-mode-based track text matching method is characterized by comprising the following steps:
s1: extracting track features and calculating track feature matrixes of different semantic layers;
s2: performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN (reduced graphics context) model, and updating the nodes based on an attention mechanism to obtain word feature vectors;
s3: calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score;
s1 comprises the following steps:
s11: extracting three types of dimensional information data of the track data, and performing weighted fusion on the extracted three types of dimensional information data;
s12: calculating a feature vector of each track point in one track information, aggregating the feature vectors of all track points into a global vector on a global layer, aggregating the feature vectors of all track points into an action vector on an action layer, aggregating the feature vectors of all track points into an entity vector on a physical layer, and obtaining track feature matrixes of different semantic layers;
s2 specifically comprises the following steps:
s21: analyzing the relations among the verbs, the noun phrases, the verbs and the corresponding nouns in each sentence;
s22: taking the whole sentence as a root node, taking a verb as an action node to be connected with the root node, taking a noun phrase as an entity node to be connected with the corresponding action node, and taking the edge connecting the action node and the entity node as a semantic relation to form a different semantic hierarchy;
s23: embedding action nodes and entity nodes in each sentence into a vector based on a bidirectional LSTM model;
s24: aggregating the word vectors of all words into a global event node vector through an attention mechanism;
s25: obtaining word vectors after vector embedding of action nodes and entity nodes in sentences, and performing maximum pooling operation to correspondingly obtain verb node vectors and entity node vectors;
s26: and constructing a semantic topological graph based on the global event node vector, the verb node vector and the entity node vector, inputting the global event node vector, the verb node vector and the entity node vector into the RGCN model as initialized node vectors, and updating the nodes based on the attention mechanism to obtain a word feature vector.
2. The multi-modality based track text matching method as claimed in claim 1, wherein S11 specifically includes:
the three types of dimension information comprise a track space dimension, a track state dimension and a track attribute dimension;
the calculation formula of three types of dimensional information data for extracting the track data is as follows:
e a =E 1 (X a )
e b =E 2 (X b )
e c =E 3 (X c )
the weighted fusion formula is:
e i =δ([e a *α a ,e b *α b ,e c *α c ])
wherein e x X ∈ (a, b, c) denotes a learning result for three types of dimensional information data, E x X ∈ (1, 2, 3) denotes three independent learning parameter matrices, X x X belongs to (a, b, c) represents three kinds of dimension information data, delta is a fusion function, and alpha is x And x ∈ (a, b, c) denotes different fusion parameters.
3. The multi-modality-based track text matching method as claimed in claim 2, wherein the S12 specifically includes:
s121: calculating the characteristic vector e of each track point in one piece of track information i I ∈ (1, 2,. M), vector set e = [ e ] for one track 1 ,e 2 ,...e M ]M represents the number of trace points in a trace, e is a vector group of a trace, e i The vector of the track point is obtained;
s122: for matching with texts with different semantic levels, three independent weight parameters are adoptedAndprocessing the characteristic vectors of the track points to obtain three independently expressed track characteristic vector groups, wherein the calculation formula is as follows:
whereinAndrepresenting three different learnable weight parametersNumber matrix, v c,i Representing feature vectors of trace points of the global layer, v a,i Feature vector representing trace point of action layer, v e,i Representing the characteristic vector of the trace point of the physical layer;
s123: aggregating the feature vectors of all track points into a global vector V in a global layer c Aggregating all track point feature vectors into an action vector V in the action layer a =[v a,1 ,v a,2 ,...v a,M ]And aggregating the feature vectors of all track points into a solid vector V at a solid layer e =[v e,1 ,v e,2 ,...v e,M ]And obtaining the trace feature matrixes of different semantic layers.
4. The multi-modal-based track text matching method according to claim 3, wherein a word vector W = [ W ] of each word in the sentence 1 ,w 2 ,...w N ]。
5. The multi-modal-based track text matching method according to claim 4, wherein the calculation formula of embedding the action nodes and the entity nodes into one vector is as follows:
in the formula, W C Is an embedded matrix of word vectors that is,is a learnable bias parameter for a forward LSTM network,Is a learnable bias parameter for the reverse LSTM network,a word vector representing the word in the forward computing sentence i,a word vector representing the words in the reverse-calculated sentence i,a word vector representing the word in the forward computation sentence i-1,word vector, w, representing the word in the inverse computation sentence i +1 i Representing the final word vector representation;
global event node vector g c The calculation formula is as follows:
in the formula, W e Representing a learning parameter matrix, w i Representing a word vector, w j Represented in a semantic topological graph with w i A word vector of the words with connection relation, N represents the sum w in the whole semantic topological graph i The number of words with connection relation;
verb node vector g a =[g a,1 ,g a,2 ,...g a,Na ]Vector of physical nodes g e =[g e,1 ,g e,2 ,...g e,Ne ]Which isMiddle Na and Ne represent the number of verb nodes and entity nodes, respectively.
6. The multi-modal-based track text matching method according to claim 5, wherein S26 is specifically:
s261: representing the node vector as g i ∈{g c ,g a ,g e And the node vector is input into an RGCN model, and the calculation formula is as follows:
wherein r is i,j Is a one-hot type vector of the semantic relationship corresponding to the nodes i and j,representing the initialized node vector, W r Is a relational transformation matrix;
s262: suppose the output of the RGCN of the ith node at the l layer isThe calculation process after the attention mechanism is adopted is as follows:
wherein N is i Representing the neighbor nodes of the node i,andis an attention mechanism parameter, D represents the number of nodes, α ij Representing the attention coefficient between the ith word and the jth word;
according to the operation rule of the RGCN, the update rule of the node is as follows:
in the formula, W t l+1 A learnable parameter matrix representing a layer l +1 network;
7. The multi-modality-based track text matching method as claimed in claim 6, wherein S3 specifically comprises:
s31: calculating the similarity of the global vector and the global node through a cosine function in the global layer to obtain a global matching score s c ;
Respectively obtaining an action matching score and an entity matching score by matching a track and a text in a local part at an action layer and an entity layer;
s32: and carrying out weighted summation on the global matching score, the action matching score and the entity matching score to obtain an overall similarity score.
8. The method as claimed in claim 7, wherein the method comprises a step of matching the trajectory text based on multiple modesIn S31, the global matching score S c The calculation formula is as follows:
s c =cos(t c ,v c )
the specific calculation process of the action matching score and the entity matching score is as follows:
wherein each text node vector t x,i ∈t x Where x ∈ { a, e },
vector v of each trace point x,j ∈V x Where x ∈ { a, e };
and carrying out normalization processing on the matching score, wherein the calculation formula is as follows:
wherein [. ]] + =max(·,0),Is an attention parameter that indicates whether the word node i has a relationship with the track node j; λ represents a learnable neural network parameter;
calculating the matching score of the word node i to the whole track, wherein the calculation formula is as follows:
calculating a local matching score s of the text and the track x =∑ i s x,i ;
When x is a, the local matching score is the action matching score Sa, and when x is e, the local matching score is the entity matching score Se.
9. The multi-modal-based track text matching method according to claim 8, wherein the overall similarity score s (v, t) is calculated by the formula:
s(v,t)=(s c ,s a ,s e )/3。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211155406.8A CN115248877B (en) | 2022-09-22 | 2022-09-22 | Multi-mode-based track text matching method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211155406.8A CN115248877B (en) | 2022-09-22 | 2022-09-22 | Multi-mode-based track text matching method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115248877A CN115248877A (en) | 2022-10-28 |
CN115248877B true CN115248877B (en) | 2023-01-17 |
Family
ID=83699220
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211155406.8A Active CN115248877B (en) | 2022-09-22 | 2022-09-22 | Multi-mode-based track text matching method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115248877B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103196430A (en) * | 2013-04-27 | 2013-07-10 | 清华大学 | Mapping navigation method and system based on flight path and visual information of unmanned aerial vehicle |
CN109033245A (en) * | 2018-07-05 | 2018-12-18 | 清华大学 | A kind of mobile robot visual-radar image cross-module state search method |
CN110647662A (en) * | 2019-08-03 | 2020-01-03 | 电子科技大学 | Multi-mode spatiotemporal data association method based on semantics |
CN113780003A (en) * | 2021-08-31 | 2021-12-10 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Cross-modal enhancement method for space-time data variable-division encoding and decoding |
CN114443864A (en) * | 2022-01-29 | 2022-05-06 | 北京百度网讯科技有限公司 | Cross-modal data matching method and device and computer program product |
-
2022
- 2022-09-22 CN CN202211155406.8A patent/CN115248877B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103196430A (en) * | 2013-04-27 | 2013-07-10 | 清华大学 | Mapping navigation method and system based on flight path and visual information of unmanned aerial vehicle |
CN109033245A (en) * | 2018-07-05 | 2018-12-18 | 清华大学 | A kind of mobile robot visual-radar image cross-module state search method |
CN110647662A (en) * | 2019-08-03 | 2020-01-03 | 电子科技大学 | Multi-mode spatiotemporal data association method based on semantics |
CN113780003A (en) * | 2021-08-31 | 2021-12-10 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Cross-modal enhancement method for space-time data variable-division encoding and decoding |
CN114443864A (en) * | 2022-01-29 | 2022-05-06 | 北京百度网讯科技有限公司 | Cross-modal data matching method and device and computer program product |
Non-Patent Citations (3)
Title |
---|
Temporal Multi-modal Graph Transformer with Global-Local Alignment for Video-Text Retrieval;Zerun Feng;《IEEE Transactions on Circuits and Systems for Video Technology》;20211231;第1-16页 * |
基于时间序列的轨迹数据相似性度量方法研究及应用综述;潘晓等;《燕山大学学报》;20191130(第06期);第65-79页 * |
多模态时空语义轨迹预测;朱少杰;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20220315;第I138-1807页 * |
Also Published As
Publication number | Publication date |
---|---|
CN115248877A (en) | 2022-10-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yu et al. | Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition | |
Zhang et al. | A spatial attentive and temporal dilated (SATD) GCN for skeleton‐based action recognition | |
CN111488984B (en) | Method for training track prediction model and track prediction method | |
WO2023280065A1 (en) | Image reconstruction method and apparatus for cross-modal communication system | |
CN111488734A (en) | Emotional feature representation learning system and method based on global interaction and syntactic dependency | |
CN112560432B (en) | Text emotion analysis method based on graph attention network | |
CN112084331A (en) | Text processing method, text processing device, model training method, model training device, computer equipment and storage medium | |
CN110490254B (en) | Image semantic generation method based on double attention mechanism hierarchical network | |
CN114048350A (en) | Text-video retrieval method based on fine-grained cross-modal alignment model | |
CN111414461A (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN113076465A (en) | Universal cross-modal retrieval model based on deep hash | |
Yue et al. | Action recognition based on RGB and skeleton data sets: A survey | |
CN113780003B (en) | Cross-modal enhancement method for space-time data variable-division encoding and decoding | |
CN112487822A (en) | Cross-modal retrieval method based on deep learning | |
CN114942998B (en) | Knowledge graph neighborhood structure sparse entity alignment method integrating multi-source data | |
CN115392248A (en) | Event extraction method based on context and drawing attention | |
Zhao | A systematic survey of remote sensing image captioning | |
CN113641854B (en) | Method and system for converting text into video | |
Zhao et al. | Convolutional network embedding of text-enhanced representation for knowledge graph completion | |
CN115248877B (en) | Multi-mode-based track text matching method | |
CN116883746A (en) | Graph node classification method based on partition pooling hypergraph neural network | |
CN113516118B (en) | Multi-mode cultural resource processing method for joint embedding of images and texts | |
Qiang et al. | Hybrid deep neural network-based cross-modal image and text retrieval method for large-scale data | |
Zhang et al. | HSIE: improving named entity disambiguation with hidden semantic information extractor | |
CN114254738A (en) | Double-layer evolvable dynamic graph convolution neural network model construction method and application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |