CN115248877B - Multi-mode-based track text matching method - Google Patents

Multi-mode-based track text matching method Download PDF

Info

Publication number
CN115248877B
CN115248877B CN202211155406.8A CN202211155406A CN115248877B CN 115248877 B CN115248877 B CN 115248877B CN 202211155406 A CN202211155406 A CN 202211155406A CN 115248877 B CN115248877 B CN 115248877B
Authority
CN
China
Prior art keywords
vector
track
node
word
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211155406.8A
Other languages
Chinese (zh)
Other versions
CN115248877A (en
Inventor
王立才
唐瀚超
罗琪彬
武广胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202211155406.8A priority Critical patent/CN115248877B/en
Publication of CN115248877A publication Critical patent/CN115248877A/en
Application granted granted Critical
Publication of CN115248877B publication Critical patent/CN115248877B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a multi-mode-based track text matching method, which comprises the following steps: extracting track features and calculating track feature matrixes of different semantic layers; performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN model, and updating the nodes based on an attention mechanism to obtain word feature vectors; and calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score. The text is divided into different semantic levels to be independently coded, so that the detail information of the text is reserved, the track and the topological structure in the text are also reserved, and the accuracy of matching retrieval is improved; meanwhile, an attention mechanism and a parameter sharing method are introduced, so that the number of calculation parameters is reduced, and the calculation efficiency is improved.

Description

Multi-mode-based track text matching method
Technical Field
The invention relates to the technical field of machine learning, in particular to a multi-mode-based track text matching method.
Background
Today, as information technology has rapidly developed, multimodal data has become the main form of data resources in recent times. The multi-modal data has the characteristics of multi-source isomerism and internal semantic consistency in the description of the same object. The different modality forms each describe a feature of the object at a particular angle. The method has important value in the research of the multi-mode learning method and the capability of solving multi-source heterogeneous mass data by a computing mechanism. In recent years, a number of highly effective multimodal deep learning algorithms have been applied to practical scenes. For example, the image vector retrieval of the video frame and the face frame which are used by the Youkou video can not be supported by the multi-modal recognition technology due to the 'only see TA' function proposed by the Aichi art. The technologies of 'shooting and panning' and the like of each large e-commerce platform use information fusion under multiple modes such as images, texts, high-level semantic attributes and the like to realize the function of 'searching images with images' with high precision.
The existing cross-modal retrieval method for video texts mainly comprises two methods:
the first is a global coding method, which extracts video and text features through different algorithms, codes the video and corresponding text feature vectors into an integral vector, and calculates the distance between the two vectors in a common space according to a defined distance function to measure the similarity between the video and the text. However, the global coding method directly codes the video and the corresponding text into an integral vector, and this coding method may lose the detail information in the video and the text, and a section of text may contain a plurality of actions, each action corresponding to a different entity. The details may be lost by simply embedding the text into a vector, and the corresponding relationship between the video and the text cannot be better understood, so that the model has a poor effect.
The second is a local encoding method, each frame of the video and each word of the text are encoded into a vector, the corresponding similarity of each word and the video frame is calculated, and the similarity is summed to be the overall similarity. However, the local coding mode greatly increases the calculation parameters of the model, so that the learning and matching detection efficiency is lower; while these sequential vectors representing video and text ignore the topology in video and text, such that vector matches may be trapped in local matches and global matches ignored.
In order to overcome the defects of the prior art, the invention aims at the problems that the recommendation of proper natural language description from massive track data is urgent to solve in the next large data processing, and provides a track text matching method based on multiple modes.
Disclosure of Invention
In view of the above, the invention provides a multi-mode-based track text matching method, which divides a text into three semantic levels to perform independent coding respectively, so that not only is the detail information of the text kept, but also the track and the topological structure in the text are kept, and the accuracy of matching retrieval is improved; meanwhile, the attention mechanism and the parameter sharing method are introduced, so that the number of calculation parameters is reduced, and the calculation efficiency is improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-modal-based track text matching method comprises the following steps:
s1: extracting track features and calculating track feature matrixes of different semantic layers;
s2: performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN model, and updating the nodes based on an attention mechanism to obtain word feature vectors;
s3: and calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score.
Preferably, S1 comprises:
s11: extracting three types of dimensional information data of the track data, and performing weighted fusion on the extracted three types of dimensional information data;
s12: calculating the characteristic vector of each track point in one piece of track information, aggregating the characteristic vectors of all track points into a global vector on a global layer, aggregating the characteristic vectors of all track points into an action vector on an action layer, aggregating the characteristic vectors of all track points into an entity vector on a physical layer, and obtaining track characteristic matrixes of different semantic layers.
Preferably, S11 specifically includes:
the three types of dimension information comprise a track space dimension, a track state dimension and a track attribute dimension;
the calculation formula of three types of dimensional information data for extracting the track data is as follows:
Figure 635701DEST_PATH_IMAGE001
Figure 868100DEST_PATH_IMAGE002
Figure 170905DEST_PATH_IMAGE003
the weighted fusion formula is:
Figure 918281DEST_PATH_IMAGE004
wherein
Figure 331945DEST_PATH_IMAGE005
Representing the learning result of three kinds of dimensional information data,
Figure 215587DEST_PATH_IMAGE006
three separate learning parameter matrices are shown,
Figure 892556DEST_PATH_IMAGE007
represents three types of dimensional information data,
Figure 861649DEST_PATH_IMAGE008
in order to be a function of fusion,
Figure 79004DEST_PATH_IMAGE009
representing different fusion parameters.
Preferably, S12 specifically includes:
s121: calculating the characteristic vector of each track point in one piece of track information
Figure 82732DEST_PATH_IMAGE010
Figure 258499DEST_PATH_IMAGE011
A vector group e of one track = [ ]
Figure 449308DEST_PATH_IMAGE012
,...
Figure 1513DEST_PATH_IMAGE013
]M represents the number of trace points in a trace, e is a vector group of a trace,
Figure 600027DEST_PATH_IMAGE010
vector of the trace point;
s122: for matching with texts with different semantic levels, three independent weight parameters are adopted
Figure 415537DEST_PATH_IMAGE014
Figure 562484DEST_PATH_IMAGE015
And
Figure 121641DEST_PATH_IMAGE016
processing the characteristic vectors of the track points to obtain three independently expressed track characteristic vector groups, wherein the calculation formula is as follows:
Figure 99962DEST_PATH_IMAGE017
Figure 820793DEST_PATH_IMAGE018
Figure 907567DEST_PATH_IMAGE019
wherein
Figure 535994DEST_PATH_IMAGE014
Figure 368821DEST_PATH_IMAGE015
And
Figure 260553DEST_PATH_IMAGE016
representing three different learnable weight parameter matrices,
Figure 178831DEST_PATH_IMAGE020
representing the feature vectors of the trace points of the global layer,
Figure 283053DEST_PATH_IMAGE021
the feature vector of the trace point of the action layer is represented,
Figure 501545DEST_PATH_IMAGE022
representing the characteristic vector of the trace point of the physical layer;
s123: aggregating the feature vectors of all the track points into a global vector in a global layer
Figure 33020DEST_PATH_IMAGE023
And aggregating all track point feature vectors into action vectors in the action layer
Figure 173015DEST_PATH_IMAGE024
[
Figure 877666DEST_PATH_IMAGE025
,...
Figure 685085DEST_PATH_IMAGE026
]And aggregating the feature vectors of all the tracing points into a solid vector at a solid layer
Figure 184199DEST_PATH_IMAGE027
[
Figure 811489DEST_PATH_IMAGE028
,...
Figure 257514DEST_PATH_IMAGE029
]And obtaining the trace feature matrixes of different semantic layers.
Preferably, S2 specifically includes:
s21: analyzing the relations among verbs, noun phrases, verbs and corresponding nouns in each sentence;
s22: taking the whole sentence as a root node, taking verbs as action nodes to be connected with the root node, taking noun phrases as entity nodes to be connected with the corresponding action nodes, and taking edges connecting the action nodes with the entity nodes as semantic relations to form different semantic hierarchies;
s23: embedding action nodes and entity nodes in each sentence into a vector based on a bidirectional LSTM model, and recording word vectors of each word in the sentenceW=[
Figure 653861DEST_PATH_IMAGE030
,...
Figure 323876DEST_PATH_IMAGE031
];
S24: aggregating the word vectors of all words into a global event node vector through an attention mechanism;
s25: obtaining word vectors after vector embedding of action nodes and entity nodes in sentences, and performing maximum pooling operation to correspondingly obtain verb node vectors and entity node vectors;
s26: and constructing a semantic topological graph based on the global event node vector, the verb node vector and the entity node vector, inputting the global event node vector, the verb node vector and the entity node vector into the RGCN model as initialized node vectors, and updating the nodes based on the attention mechanism to obtain word feature vectors.
Preferably, the calculation formula of embedding the action node and the entity node into one vector is as follows:
Figure 704042DEST_PATH_IMAGE032
Figure 484916DEST_PATH_IMAGE033
Figure 1348DEST_PATH_IMAGE034
in the formula (I), the compound is shown in the specification,
Figure 842265DEST_PATH_IMAGE035
is an embedded matrix of word vectors that is,
Figure 444148DEST_PATH_IMAGE036
is a learnable bias parameter for the forward LSTM network,
Figure 231975DEST_PATH_IMAGE037
is a learnable bias parameter for an inverse LSTM network,
Figure 602914DEST_PATH_IMAGE038
a word vector representing the word in the forward computing sentence i,
Figure 880311DEST_PATH_IMAGE039
representing a word vector that calculates words in sentence i in reverse,
Figure 721489DEST_PATH_IMAGE040
a word vector representing the words in the forward computed sentence i-1,
Figure 109745DEST_PATH_IMAGE041
representing the word vector of the word in the inverse computation sentence i +1,
Figure 538452DEST_PATH_IMAGE042
representing the final word vector representation;
global event node vector
Figure 721172DEST_PATH_IMAGE043
The calculation formula is as follows:
Figure 563226DEST_PATH_IMAGE044
Figure 489594DEST_PATH_IMAGE045
in the formula (I), the compound is shown in the specification,
Figure 569545DEST_PATH_IMAGE046
a matrix of learning parameters is represented that is,
Figure 392008DEST_PATH_IMAGE047
a vector of words is represented that is,
Figure 190199DEST_PATH_IMAGE048
represented in a semantic topological graph with
Figure 920258DEST_PATH_IMAGE047
A word vector of words with connection relation, N represents the sum in the whole semantic topological graph
Figure 120295DEST_PATH_IMAGE047
The number of words with connection relation;
verb node vector
Figure 910397DEST_PATH_IMAGE049
Vector of physical nodes
Figure 195885DEST_PATH_IMAGE050
Where Na and Ne represent the number of verb nodes and entity nodes, respectively.
Preferably, S26 is specifically:
s261: representing a node vector as
Figure 464055DEST_PATH_IMAGE051
And inputting the node vector into the RGCN model, wherein the calculation formula is as follows:
Figure 721861DEST_PATH_IMAGE052
wherein
Figure 682864DEST_PATH_IMAGE053
Is a one-hot type vector of the semantic relationship corresponding to the nodes i and j,
Figure 190068DEST_PATH_IMAGE054
representing the initialized node vector after the initialization of the node vector,
Figure 527509DEST_PATH_IMAGE055
is a relational transformation matrix;
s262: suppose that the ith node is atlThe RGCN of a layer has an output of
Figure 702138DEST_PATH_IMAGE056
The calculation process after the attention mechanism is adopted is as follows:
Figure 568463DEST_PATH_IMAGE057
Figure 31805DEST_PATH_IMAGE058
wherein
Figure 641778DEST_PATH_IMAGE059
Represents a neighbor node of the node i,
Figure 670914DEST_PATH_IMAGE060
and
Figure 973719DEST_PATH_IMAGE061
is the attention mechanism parameter, D represents the number of nodes,
Figure 721096DEST_PATH_IMAGE062
representing the attention coefficient between the ith word and the jth word;
according to the operation rule of the RGCN, the update rule of the node is as follows:
Figure 338022DEST_PATH_IMAGE063
in the formula (I), the compound is shown in the specification,
Figure 221664DEST_PATH_IMAGE064
is shown aslA learnable parameter matrix of the +1 layer network;
assuming that the number of layers of the RGCN is L, the node vector output by the L-th layer is the word feature vector, wherein
Figure 695371DEST_PATH_IMAGE065
A vector of global nodes is represented that is,
Figure 664464DEST_PATH_IMAGE066
a set of motion node vectors is represented,
Figure 881819DEST_PATH_IMAGE067
representing a set of entity node vectors.
Preferably, S3 specifically includes:
s31: calculating the similarity of the global vector and the global node through a cosine function in the global layer to obtain a global matching score
Figure 885547DEST_PATH_IMAGE068
Respectively obtaining an action matching score and an entity matching score by locally matching a track and a text in an action layer and an entity layer;
s32: and carrying out weighted summation on the global matching score, the action matching score and the entity matching score to obtain an overall similarity score.
Preferably, in S31, the global matching score
Figure 467838DEST_PATH_IMAGE068
The calculation formula is as follows:
Figure 720964DEST_PATH_IMAGE069
=
Figure 476431DEST_PATH_IMAGE070
the specific calculation process of the action matching score and the entity matching score is as follows:
calculating a pair of local cross-modal matching scores by the following formula:
Figure 537928DEST_PATH_IMAGE071
=
Figure 353437DEST_PATH_IMAGE072
wherein each text node vector
Figure 297122DEST_PATH_IMAGE073
Wherein x is
Figure 856280DEST_PATH_IMAGE074
{ a, e }, vector of each trace point
Figure 574880DEST_PATH_IMAGE075
Wherein x is
Figure 561291DEST_PATH_IMAGE074
{a,e};
And carrying out normalization processing on the matching score, wherein the calculation formula is as follows:
Figure 195534DEST_PATH_IMAGE076
=softmax(
Figure 292803DEST_PATH_IMAGE077
)
wherein
Figure 860051DEST_PATH_IMAGE078
Figure 17363DEST_PATH_IMAGE076
Is an attention parameter, which indicates whether the word node i has a relationship with the track node j,
Figure 935640DEST_PATH_IMAGE079
representing a learnable neural network parameter;
calculating the matching score of the word node i to the whole track, wherein the calculation formula is as follows:
Figure 836600DEST_PATH_IMAGE080
=
Figure 523933DEST_PATH_IMAGE081
calculating a local matching score of the text and the track as
Figure 55409DEST_PATH_IMAGE082
=
Figure 195403DEST_PATH_IMAGE083
When x is a, the local matching score is the action matching score Sa, and when x is e, the local matching score is the entity matching score Se.
Preferably, the overall similarity score
Figure 900054DEST_PATH_IMAGE084
The calculation formula is as follows:
Figure 441894DEST_PATH_IMAGE085
the invention has the following advantages:
(1) The method comprises the steps of dividing all words of a text into three different semantic levels by using a natural language processing technology of semantic annotation, constructing a semantic topological graph according to a semantic role relationship, fully extracting fine-grained information of the text, and simultaneously keeping a topological structure of the text, so that the learned text features are more accurate.
(2) Matching tracks and text from both global and local dimensions, and weighting and summing matching scores. The global is responsible for the matching degree between the track and the whole text and the local is responsible for the matching degree between the track point and the word, so that the interpretability of track text matching is increased while the matching accuracy is improved.
(3) By adopting the idea of multi-mode representation learning and the method of mapping the track and the text to the same vector space, the multi-target interaction information contained in the text is fused on the basis of fully excavating the characteristics of the track, so that the accuracy of track anomaly detection can be effectively improved. Meanwhile, the tracks and the texts are matched from the global dimension and the local dimension, so that the matching accuracy is improved, and the interpretability is provided for the matching result.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of a multi-modal-based track text matching method provided by the invention.
Fig. 2 is a schematic diagram illustrating the track feature extraction provided by the present invention.
Fig. 3 is a schematic diagram illustrating text layering embedding provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The embodiment of the invention discloses a multi-mode-based track text matching method, which comprises the following steps of:
s1: extracting track features and calculating track feature matrixes of different semantic layers;
s2: performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN (reduced graphics context) model, and updating the nodes based on an attention mechanism to obtain word feature vectors;
s3: and calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score.
In this embodiment, as shown in fig. 2, the specific implementation process of S1 is as follows:
s11: three types of dimensional information data of the track data are extracted, the three types of dimensional information comprise track space dimensions, track state dimensions and track attribute dimensions, time, longitude, latitude, height and the like are track space dimensional data, course, speed and the like are track state dimensional data, flight numbers, call signs, departure places, destinations and the like are track attribute dimensional data, the extracted three types of dimensional information data are weighted and fused, and the characteristic extraction formula and the weighted fusion formula are shown as formulas (1), (2), (3) and (4):
Figure 941008DEST_PATH_IMAGE086
(1)
Figure 833878DEST_PATH_IMAGE087
(2)
Figure 811061DEST_PATH_IMAGE088
(3)
Figure 676249DEST_PATH_IMAGE089
(4)
wherein
Figure 346265DEST_PATH_IMAGE005
Represents the learning result of the three kinds of dimensional information data,
Figure 460851DEST_PATH_IMAGE006
three separate learning parameter matrices are shown,
Figure 241725DEST_PATH_IMAGE007
represents three types of dimensional information data,
Figure 758157DEST_PATH_IMAGE008
in order to be a function of fusion,
Figure 864654DEST_PATH_IMAGE009
representing different fusion parameters;
s12: calculating the characteristic vector of each track point in one track information, aggregating the characteristic vectors of all track points into a global vector in a global layer, aggregating the characteristic vectors of all track points into an action vector in an action layer, aggregating the characteristic vectors of all track points into an entity vector in a physical layer, and obtaining track characteristic matrixes of different semantic layers. The method specifically comprises the following steps:
s121: calculating the characteristic vector of each track point in one piece of track information
Figure 466536DEST_PATH_IMAGE010
Figure 254364DEST_PATH_IMAGE011
A vector group e of one track = [ ]
Figure 625302DEST_PATH_IMAGE012
,...
Figure 637121DEST_PATH_IMAGE013
]M represents the number of track points in a track, e is a vector group of a track,
Figure 460720DEST_PATH_IMAGE010
vector of the trace point;
s122: for matching with texts with different semantic levels, three independent weight parameters are used
Figure 114555DEST_PATH_IMAGE014
Figure 340000DEST_PATH_IMAGE015
And
Figure 522720DEST_PATH_IMAGE016
processing the characteristic vectors of the track points, wherein the calculation formula is as follows:
Figure 302457DEST_PATH_IMAGE017
Figure 228825DEST_PATH_IMAGE018
Figure 308776DEST_PATH_IMAGE090
(5)
wherein
Figure 927976DEST_PATH_IMAGE014
Figure 991747DEST_PATH_IMAGE015
And
Figure 456227DEST_PATH_IMAGE016
representing three different learnable weight parameter matrices,
Figure 656264DEST_PATH_IMAGE020
representing the feature vectors of the trace points of the global layer,
Figure 649628DEST_PATH_IMAGE021
the feature vector of the trace point of the action layer is represented,
Figure 663677DEST_PATH_IMAGE022
representing the characteristic vector of the trace point of the physical layer;
s123: aggregating the feature vectors of all the track points into a global vector in a global layer
Figure 197427DEST_PATH_IMAGE091
And aggregating all track point feature vectors into action vectors in the action layer
Figure 517549DEST_PATH_IMAGE092
[
Figure 212973DEST_PATH_IMAGE025
,...
Figure 985757DEST_PATH_IMAGE026
]And aggregating the feature vectors of all the tracing points into a solid vector at a solid layer
Figure 57618DEST_PATH_IMAGE093
[
Figure 966668DEST_PATH_IMAGE028
,...
Figure 301835DEST_PATH_IMAGE029
]And obtaining the trace feature matrixes of different semantic layers.
In this embodiment, the track description text naturally contains a hierarchical structure, as shown in FIG. 3. The whole sentence describes the global event, the sentence is composed of a plurality of verbs, and each verb is composed of a corresponding subject and an object. The track description text can be quickly and comprehensively understood through the method which comprises the global description and the local analysis. S2 thus describes in particular how to generate such a hierarchical text representation containing both global descriptions and local analyses:
s21: first, for a sentence C = [, ]
Figure 296336DEST_PATH_IMAGE094
,...
Figure 437467DEST_PATH_IMAGE095
]And N represents the number of words in the sentence, and the whole sentence C is taken as a root node of the graph. Inputting the sentence C into the ready-made semantic analysis tool to obtain the semantic roles of the verbs, the noun phrases, and the noun phrase of each corresponding verb in the sentence.
S22: each verb is used as an action node to be directly connected with a root node, each noun phrase is used as an entity node to be connected with the corresponding action node, and because the action node and the entity node have a corresponding semantic relationship, the edges connected by the action node and the entity node are used for recording the corresponding semantic relationship;
s23: embedding action nodes and entity nodes in each sentence into a vector based on a bidirectional LSTM model, and recording word vectors of each word in the sentenceW=[
Figure 466603DEST_PATH_IMAGE030
,...
Figure 503829DEST_PATH_IMAGE031
]The calculation results are as follows:
Figure 454467DEST_PATH_IMAGE096
in the formula (I), the compound is shown in the specification,
Figure 868131DEST_PATH_IMAGE035
is a word vectorThe embedded matrix of (a) is embedded,
Figure 751774DEST_PATH_IMAGE036
is a learnable bias parameter for the forward LSTM network,
Figure 959901DEST_PATH_IMAGE037
is a learnable bias parameter for an inverse LSTM network,
Figure 194573DEST_PATH_IMAGE038
a word vector representing the word in the forward computing sentence i,
Figure 411928DEST_PATH_IMAGE039
representing a word vector that calculates words in sentence i in reverse,
Figure 150077DEST_PATH_IMAGE097
a word vector representing the word in the forward computation sentence i-1,
Figure 794685DEST_PATH_IMAGE098
representing the word vector of the word in the inverse computation sentence i +1,
Figure 454336DEST_PATH_IMAGE042
representing the final word vector representation;
s24: for the root node, because all descriptions of the text need to be obtained, all word vectors are aggregated into one vector through an attention mechanism, and at this time, a global event node vector needs to be calculated
Figure 209803DEST_PATH_IMAGE043
The calculation process is as follows:
Figure 333616DEST_PATH_IMAGE044
(9)
Figure 149126DEST_PATH_IMAGE099
(10)
in the formula (I), the compound is shown in the specification,
Figure 827232DEST_PATH_IMAGE046
a matrix of learning parameters is represented and,
Figure 386389DEST_PATH_IMAGE047
a vector of words is represented that is,
Figure 302392DEST_PATH_IMAGE048
represented in a semantic topological graph with
Figure 288803DEST_PATH_IMAGE047
A word vector of words with connection relation, N represents the sum in the whole semantic topological graph
Figure 985364DEST_PATH_IMAGE047
The number of words with connection relation;
s25: for verb nodes and entity nodes, corresponding word vectors are mapped on the basis of LSTM embedding
Figure 82633DEST_PATH_IMAGE042
Performing maximum pooling operation to obtain verb node vector
Figure 649880DEST_PATH_IMAGE049
And entity node vector
Figure 807192DEST_PATH_IMAGE050
Wherein Na and Ne represent the number of verb nodes and entity nodes, respectively;
s26: and inputting the global event node vector, the verb node vector and the entity node vector as initialized node vectors into the RGCN model, and updating the nodes based on the attention mechanism to obtain word feature vectors.
Specifically, the S26 is implemented as follows:
s261: representing a node vector as
Figure 459890DEST_PATH_IMAGE051
And inputting the node vector into the RGCN model, wherein the calculation formula is as follows:
Figure 564112DEST_PATH_IMAGE100
(11)
wherein
Figure 251446DEST_PATH_IMAGE053
Is a one-hot type vector of semantic relation corresponding to the nodes i and j,
Figure 845238DEST_PATH_IMAGE054
representing the initialized node vector after the initialization of the node vector,
Figure 985232DEST_PATH_IMAGE055
is a relational transformation matrix;
s262: suppose that the ith node is atlThe RGCN of a layer has an output of
Figure 430163DEST_PATH_IMAGE056
The calculation process after the attention mechanism is adopted is as follows:
Figure 972003DEST_PATH_IMAGE057
(12)
Figure 674380DEST_PATH_IMAGE101
(13)
wherein
Figure 301670DEST_PATH_IMAGE059
Representing the neighbor nodes of the node i,
Figure 544433DEST_PATH_IMAGE060
and
Figure 471938DEST_PATH_IMAGE061
is a parameter of the attention mechanism and,d represents the number of nodes and is,
Figure 141953DEST_PATH_IMAGE102
representing an attention coefficient between the ith word and the jth word;
according to the operation rule of the RGCN, the update rule of the node is as follows:
Figure 990961DEST_PATH_IMAGE103
(14)
in the formula (I), the compound is shown in the specification,
Figure 37414DEST_PATH_IMAGE064
is shown aslA learnable parameter matrix of the +1 layer network;
assuming that the number of layers of the RGCN is L, the node vector output by the L-th layer is a word feature vector;
wherein
Figure 553846DEST_PATH_IMAGE065
A vector of global nodes is represented that is,
Figure 598025DEST_PATH_IMAGE066
a set of motion node vectors is represented,
Figure 934329DEST_PATH_IMAGE067
representing a set of entity node vectors.
The invention converts the matrix
Figure 518894DEST_PATH_IMAGE055
The method is set as the shared parameter, namely all the relations share the conversion matrix, and solves the problems that in the prior art, an R-GCN model needs to calculate a relation matrix and a conversion matrix for each relation, so that a large number of parameters needing to be calculated are generated, the calculation efficiency of the model is seriously influenced, and overfitting in the training process is caused.
In this embodiment, the specific implementation process of S3 is as follows:
s31: at the global levelBecause the track and the text are coded into a one-dimensional global vector, the similarity of the two vectors can be measured by directly using a cosine function, and the similarity of the global vector and the global node is calculated through the cosine function to obtain a global matching score
Figure 889832DEST_PATH_IMAGE104
=
Figure 167230DEST_PATH_IMAGE105
At the action and entity level, matching of tracks and texts needs to be performed locally. For each text node vector
Figure 256409DEST_PATH_IMAGE073
Wherein x is
Figure 379086DEST_PATH_IMAGE074
{ a, e }, vector of each trace point
Figure 73372DEST_PATH_IMAGE075
Wherein x is
Figure 256092DEST_PATH_IMAGE074
{ a, e }; first, a pair of local cross-modal matching scores is calculated
Figure 566987DEST_PATH_IMAGE071
=
Figure 493355DEST_PATH_IMAGE072
. In order to show which track points in the word bottom and the track have relations, the score needs to be normalized, and the formula is as follows:
Figure 104465DEST_PATH_IMAGE076
=softmax(
Figure 458086DEST_PATH_IMAGE077
) (15)
wherein
Figure 256278DEST_PATH_IMAGE078
Figure 189599DEST_PATH_IMAGE076
In effect, an attention parameter, which indicates whether the word node i has a relationship with the track node j,
Figure 389636DEST_PATH_IMAGE079
representing a learnable neural network parameter.
Calculating the matching score of the word node i to the whole track
Figure 914158DEST_PATH_IMAGE080
=
Figure 465225DEST_PATH_IMAGE081
Get a local match score of text and track as
Figure 998975DEST_PATH_IMAGE082
=
Figure 53518DEST_PATH_IMAGE083
S32: weighting and summing the global matching score, the action matching score and the entity matching score to obtain an overall similarity score
Figure 748942DEST_PATH_IMAGE106
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A multi-mode-based track text matching method is characterized by comprising the following steps:
s1: extracting track features and calculating track feature matrixes of different semantic layers;
s2: performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN (reduced graphics context) model, and updating the nodes based on an attention mechanism to obtain word feature vectors;
s3: calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score;
s1 comprises the following steps:
s11: extracting three types of dimensional information data of the track data, and performing weighted fusion on the extracted three types of dimensional information data;
s12: calculating a feature vector of each track point in one track information, aggregating the feature vectors of all track points into a global vector on a global layer, aggregating the feature vectors of all track points into an action vector on an action layer, aggregating the feature vectors of all track points into an entity vector on a physical layer, and obtaining track feature matrixes of different semantic layers;
s2 specifically comprises the following steps:
s21: analyzing the relations among the verbs, the noun phrases, the verbs and the corresponding nouns in each sentence;
s22: taking the whole sentence as a root node, taking a verb as an action node to be connected with the root node, taking a noun phrase as an entity node to be connected with the corresponding action node, and taking the edge connecting the action node and the entity node as a semantic relation to form a different semantic hierarchy;
s23: embedding action nodes and entity nodes in each sentence into a vector based on a bidirectional LSTM model;
s24: aggregating the word vectors of all words into a global event node vector through an attention mechanism;
s25: obtaining word vectors after vector embedding of action nodes and entity nodes in sentences, and performing maximum pooling operation to correspondingly obtain verb node vectors and entity node vectors;
s26: and constructing a semantic topological graph based on the global event node vector, the verb node vector and the entity node vector, inputting the global event node vector, the verb node vector and the entity node vector into the RGCN model as initialized node vectors, and updating the nodes based on the attention mechanism to obtain a word feature vector.
2. The multi-modality based track text matching method as claimed in claim 1, wherein S11 specifically includes:
the three types of dimension information comprise a track space dimension, a track state dimension and a track attribute dimension;
the calculation formula of three types of dimensional information data for extracting the track data is as follows:
e a =E 1 (X a )
e b =E 2 (X b )
e c =E 3 (X c )
the weighted fusion formula is:
e i =δ([e aa ,e bb ,e cc ])
wherein e x X ∈ (a, b, c) denotes a learning result for three types of dimensional information data, E x X ∈ (1, 2, 3) denotes three independent learning parameter matrices, X x X belongs to (a, b, c) represents three kinds of dimension information data, delta is a fusion function, and alpha is x And x ∈ (a, b, c) denotes different fusion parameters.
3. The multi-modality-based track text matching method as claimed in claim 2, wherein the S12 specifically includes:
s121: calculating the characteristic vector e of each track point in one piece of track information i I ∈ (1, 2,. M), vector set e = [ e ] for one track 1 ,e 2 ,...e M ]M represents the number of trace points in a trace, e is a vector group of a trace, e i The vector of the track point is obtained;
s122: for matching with texts with different semantic levels, three independent weight parameters are adopted
Figure FDA0003940970610000021
And
Figure FDA0003940970610000022
processing the characteristic vectors of the track points to obtain three independently expressed track characteristic vector groups, wherein the calculation formula is as follows:
Figure FDA0003940970610000023
Figure FDA0003940970610000024
Figure FDA0003940970610000031
wherein
Figure FDA0003940970610000032
And
Figure FDA0003940970610000033
representing three different learnable weight parametersNumber matrix, v c,i Representing feature vectors of trace points of the global layer, v a,i Feature vector representing trace point of action layer, v e,i Representing the characteristic vector of the trace point of the physical layer;
s123: aggregating the feature vectors of all track points into a global vector V in a global layer c Aggregating all track point feature vectors into an action vector V in the action layer a =[v a,1 ,v a,2 ,...v a,M ]And aggregating the feature vectors of all track points into a solid vector V at a solid layer e =[v e,1 ,v e,2 ,...v e,M ]And obtaining the trace feature matrixes of different semantic layers.
4. The multi-modal-based track text matching method according to claim 3, wherein a word vector W = [ W ] of each word in the sentence 1 ,w 2 ,...w N ]。
5. The multi-modal-based track text matching method according to claim 4, wherein the calculation formula of embedding the action nodes and the entity nodes into one vector is as follows:
Figure FDA0003940970610000034
Figure FDA0003940970610000035
Figure FDA0003940970610000036
in the formula, W C Is an embedded matrix of word vectors that is,
Figure FDA0003940970610000037
is a learnable bias parameter for a forward LSTM network,
Figure FDA0003940970610000038
Is a learnable bias parameter for the reverse LSTM network,
Figure FDA0003940970610000039
a word vector representing the word in the forward computing sentence i,
Figure FDA00039409706100000310
a word vector representing the words in the reverse-calculated sentence i,
Figure FDA00039409706100000311
a word vector representing the word in the forward computation sentence i-1,
Figure FDA00039409706100000312
word vector, w, representing the word in the inverse computation sentence i +1 i Representing the final word vector representation;
global event node vector g c The calculation formula is as follows:
Figure FDA00039409706100000313
Figure FDA00039409706100000314
in the formula, W e Representing a learning parameter matrix, w i Representing a word vector, w j Represented in a semantic topological graph with w i A word vector of the words with connection relation, N represents the sum w in the whole semantic topological graph i The number of words with connection relation;
verb node vector g a =[g a,1 ,g a,2 ,...g a,Na ]Vector of physical nodes g e =[g e,1 ,g e,2 ,...g e,Ne ]Which isMiddle Na and Ne represent the number of verb nodes and entity nodes, respectively.
6. The multi-modal-based track text matching method according to claim 5, wherein S26 is specifically:
s261: representing the node vector as g i ∈{g c ,g a ,g e And the node vector is input into an RGCN model, and the calculation formula is as follows:
Figure FDA0003940970610000041
wherein r is i,j Is a one-hot type vector of the semantic relationship corresponding to the nodes i and j,
Figure FDA0003940970610000042
representing the initialized node vector, W r Is a relational transformation matrix;
s262: suppose the output of the RGCN of the ith node at the l layer is
Figure FDA0003940970610000043
The calculation process after the attention mechanism is adopted is as follows:
Figure FDA0003940970610000044
Figure FDA0003940970610000045
wherein N is i Representing the neighbor nodes of the node i,
Figure FDA0003940970610000046
and
Figure FDA0003940970610000047
is an attention mechanism parameter, D represents the number of nodes, α ij Representing the attention coefficient between the ith word and the jth word;
according to the operation rule of the RGCN, the update rule of the node is as follows:
Figure FDA0003940970610000048
in the formula, W t l+1 A learnable parameter matrix representing a layer l +1 network;
assuming that the number of layers of the RGCN is L, the node vector output by the L-th layer is the word feature vector, wherein
Figure FDA0003940970610000051
A vector of global nodes is represented that is,
Figure FDA0003940970610000052
a set of motion node vectors is represented,
Figure FDA0003940970610000053
representing a set of entity node vectors.
7. The multi-modality-based track text matching method as claimed in claim 6, wherein S3 specifically comprises:
s31: calculating the similarity of the global vector and the global node through a cosine function in the global layer to obtain a global matching score s c
Respectively obtaining an action matching score and an entity matching score by matching a track and a text in a local part at an action layer and an entity layer;
s32: and carrying out weighted summation on the global matching score, the action matching score and the entity matching score to obtain an overall similarity score.
8. The method as claimed in claim 7, wherein the method comprises a step of matching the trajectory text based on multiple modesIn S31, the global matching score S c The calculation formula is as follows:
s c =cos(t c ,v c )
the specific calculation process of the action matching score and the entity matching score is as follows:
calculating a pair of local cross-modal matching scores by the following formula:
Figure FDA0003940970610000054
wherein each text node vector t x,i ∈t x Where x ∈ { a, e },
vector v of each trace point x,j ∈V x Where x ∈ { a, e };
and carrying out normalization processing on the matching score, wherein the calculation formula is as follows:
Figure FDA0003940970610000055
wherein [. ]] + =max(·,0),
Figure FDA0003940970610000056
Is an attention parameter that indicates whether the word node i has a relationship with the track node j; λ represents a learnable neural network parameter;
calculating the matching score of the word node i to the whole track, wherein the calculation formula is as follows:
Figure FDA0003940970610000057
calculating a local matching score s of the text and the track x =∑ i s x,i
When x is a, the local matching score is the action matching score Sa, and when x is e, the local matching score is the entity matching score Se.
9. The multi-modal-based track text matching method according to claim 8, wherein the overall similarity score s (v, t) is calculated by the formula:
s(v,t)=(s c ,s a ,s e )/3。
CN202211155406.8A 2022-09-22 2022-09-22 Multi-mode-based track text matching method Active CN115248877B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211155406.8A CN115248877B (en) 2022-09-22 2022-09-22 Multi-mode-based track text matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211155406.8A CN115248877B (en) 2022-09-22 2022-09-22 Multi-mode-based track text matching method

Publications (2)

Publication Number Publication Date
CN115248877A CN115248877A (en) 2022-10-28
CN115248877B true CN115248877B (en) 2023-01-17

Family

ID=83699220

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211155406.8A Active CN115248877B (en) 2022-09-22 2022-09-22 Multi-mode-based track text matching method

Country Status (1)

Country Link
CN (1) CN115248877B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103196430A (en) * 2013-04-27 2013-07-10 清华大学 Mapping navigation method and system based on flight path and visual information of unmanned aerial vehicle
CN109033245A (en) * 2018-07-05 2018-12-18 清华大学 A kind of mobile robot visual-radar image cross-module state search method
CN110647662A (en) * 2019-08-03 2020-01-03 电子科技大学 Multi-mode spatiotemporal data association method based on semantics
CN113780003A (en) * 2021-08-31 2021-12-10 西南电子技术研究所(中国电子科技集团公司第十研究所) Cross-modal enhancement method for space-time data variable-division encoding and decoding
CN114443864A (en) * 2022-01-29 2022-05-06 北京百度网讯科技有限公司 Cross-modal data matching method and device and computer program product

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103196430A (en) * 2013-04-27 2013-07-10 清华大学 Mapping navigation method and system based on flight path and visual information of unmanned aerial vehicle
CN109033245A (en) * 2018-07-05 2018-12-18 清华大学 A kind of mobile robot visual-radar image cross-module state search method
CN110647662A (en) * 2019-08-03 2020-01-03 电子科技大学 Multi-mode spatiotemporal data association method based on semantics
CN113780003A (en) * 2021-08-31 2021-12-10 西南电子技术研究所(中国电子科技集团公司第十研究所) Cross-modal enhancement method for space-time data variable-division encoding and decoding
CN114443864A (en) * 2022-01-29 2022-05-06 北京百度网讯科技有限公司 Cross-modal data matching method and device and computer program product

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Temporal Multi-modal Graph Transformer with Global-Local Alignment for Video-Text Retrieval;Zerun Feng;《IEEE Transactions on Circuits and Systems for Video Technology》;20211231;第1-16页 *
基于时间序列的轨迹数据相似性度量方法研究及应用综述;潘晓等;《燕山大学学报》;20191130(第06期);第65-79页 *
多模态时空语义轨迹预测;朱少杰;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20220315;第I138-1807页 *

Also Published As

Publication number Publication date
CN115248877A (en) 2022-10-28

Similar Documents

Publication Publication Date Title
Yu et al. Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition
Zhang et al. A spatial attentive and temporal dilated (SATD) GCN for skeleton‐based action recognition
CN111488984B (en) Method for training track prediction model and track prediction method
WO2023280065A1 (en) Image reconstruction method and apparatus for cross-modal communication system
CN111488734A (en) Emotional feature representation learning system and method based on global interaction and syntactic dependency
CN112560432B (en) Text emotion analysis method based on graph attention network
CN112084331A (en) Text processing method, text processing device, model training method, model training device, computer equipment and storage medium
CN110490254B (en) Image semantic generation method based on double attention mechanism hierarchical network
CN114048350A (en) Text-video retrieval method based on fine-grained cross-modal alignment model
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN113076465A (en) Universal cross-modal retrieval model based on deep hash
Yue et al. Action recognition based on RGB and skeleton data sets: A survey
CN113780003B (en) Cross-modal enhancement method for space-time data variable-division encoding and decoding
CN112487822A (en) Cross-modal retrieval method based on deep learning
CN114942998B (en) Knowledge graph neighborhood structure sparse entity alignment method integrating multi-source data
CN115392248A (en) Event extraction method based on context and drawing attention
Zhao A systematic survey of remote sensing image captioning
CN113641854B (en) Method and system for converting text into video
Zhao et al. Convolutional network embedding of text-enhanced representation for knowledge graph completion
CN115248877B (en) Multi-mode-based track text matching method
CN116883746A (en) Graph node classification method based on partition pooling hypergraph neural network
CN113516118B (en) Multi-mode cultural resource processing method for joint embedding of images and texts
Qiang et al. Hybrid deep neural network-based cross-modal image and text retrieval method for large-scale data
Zhang et al. HSIE: improving named entity disambiguation with hidden semantic information extractor
CN114254738A (en) Double-layer evolvable dynamic graph convolution neural network model construction method and application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant