CN115248877B

CN115248877B - Multi-mode-based track text matching method

Info

Publication number: CN115248877B
Application number: CN202211155406.8A
Authority: CN
Inventors: 王立才; 唐瀚超; 罗琪彬; 武广胜
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2023-01-17
Anticipated expiration: 2042-09-22
Also published as: CN115248877A

Abstract

The invention discloses a multi-mode-based track text matching method, which comprises the following steps: extracting track features and calculating track feature matrixes of different semantic layers; performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN model, and updating the nodes based on an attention mechanism to obtain word feature vectors; and calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score. The text is divided into different semantic levels to be independently coded, so that the detail information of the text is reserved, the track and the topological structure in the text are also reserved, and the accuracy of matching retrieval is improved; meanwhile, an attention mechanism and a parameter sharing method are introduced, so that the number of calculation parameters is reduced, and the calculation efficiency is improved.

Description

Multi-mode-based track text matching method

Technical Field

The invention relates to the technical field of machine learning, in particular to a multi-mode-based track text matching method.

Background

Today, as information technology has rapidly developed, multimodal data has become the main form of data resources in recent times. The multi-modal data has the characteristics of multi-source isomerism and internal semantic consistency in the description of the same object. The different modality forms each describe a feature of the object at a particular angle. The method has important value in the research of the multi-mode learning method and the capability of solving multi-source heterogeneous mass data by a computing mechanism. In recent years, a number of highly effective multimodal deep learning algorithms have been applied to practical scenes. For example, the image vector retrieval of the video frame and the face frame which are used by the Youkou video can not be supported by the multi-modal recognition technology due to the 'only see TA' function proposed by the Aichi art. The technologies of 'shooting and panning' and the like of each large e-commerce platform use information fusion under multiple modes such as images, texts, high-level semantic attributes and the like to realize the function of 'searching images with images' with high precision.

The existing cross-modal retrieval method for video texts mainly comprises two methods:

the first is a global coding method, which extracts video and text features through different algorithms, codes the video and corresponding text feature vectors into an integral vector, and calculates the distance between the two vectors in a common space according to a defined distance function to measure the similarity between the video and the text. However, the global coding method directly codes the video and the corresponding text into an integral vector, and this coding method may lose the detail information in the video and the text, and a section of text may contain a plurality of actions, each action corresponding to a different entity. The details may be lost by simply embedding the text into a vector, and the corresponding relationship between the video and the text cannot be better understood, so that the model has a poor effect.

The second is a local encoding method, each frame of the video and each word of the text are encoded into a vector, the corresponding similarity of each word and the video frame is calculated, and the similarity is summed to be the overall similarity. However, the local coding mode greatly increases the calculation parameters of the model, so that the learning and matching detection efficiency is lower; while these sequential vectors representing video and text ignore the topology in video and text, such that vector matches may be trapped in local matches and global matches ignored.

In order to overcome the defects of the prior art, the invention aims at the problems that the recommendation of proper natural language description from massive track data is urgent to solve in the next large data processing, and provides a track text matching method based on multiple modes.

Disclosure of Invention

In view of the above, the invention provides a multi-mode-based track text matching method, which divides a text into three semantic levels to perform independent coding respectively, so that not only is the detail information of the text kept, but also the track and the topological structure in the text are kept, and the accuracy of matching retrieval is improved; meanwhile, the attention mechanism and the parameter sharing method are introduced, so that the number of calculation parameters is reduced, and the calculation efficiency is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-modal-based track text matching method comprises the following steps:

s1: extracting track features and calculating track feature matrixes of different semantic layers;

s2: performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN model, and updating the nodes based on an attention mechanism to obtain word feature vectors;

s3: and calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score.

Preferably, S1 comprises:

s11: extracting three types of dimensional information data of the track data, and performing weighted fusion on the extracted three types of dimensional information data;

s12: calculating the characteristic vector of each track point in one piece of track information, aggregating the characteristic vectors of all track points into a global vector on a global layer, aggregating the characteristic vectors of all track points into an action vector on an action layer, aggregating the characteristic vectors of all track points into an entity vector on a physical layer, and obtaining track characteristic matrixes of different semantic layers.

Preferably, S11 specifically includes:

the three types of dimension information comprise a track space dimension, a track state dimension and a track attribute dimension;

the calculation formula of three types of dimensional information data for extracting the track data is as follows:

the weighted fusion formula is:

wherein

Representing the learning result of three kinds of dimensional information data,

three separate learning parameter matrices are shown,

represents three types of dimensional information data,

in order to be a function of fusion,

representing different fusion parameters.

Preferably, S12 specifically includes:

s121: calculating the characteristic vector of each track point in one piece of track information

，

A vector group e of one track = [ ]

,...

]M represents the number of trace points in a trace, e is a vector group of a trace,

vector of the trace point;

s122: for matching with texts with different semantic levels, three independent weight parameters are adopted

，

And

processing the characteristic vectors of the track points to obtain three independently expressed track characteristic vector groups, wherein the calculation formula is as follows:

wherein

，

And

representing three different learnable weight parameter matrices,

representing the feature vectors of the trace points of the global layer,

the feature vector of the trace point of the action layer is represented,

representing the characteristic vector of the trace point of the physical layer;

s123: aggregating the feature vectors of all the track points into a global vector in a global layer

And aggregating all track point feature vectors into action vectors in the action layer

[

,...

]And aggregating the feature vectors of all the tracing points into a solid vector at a solid layer

[

,...

]And obtaining the trace feature matrixes of different semantic layers.

Preferably, S2 specifically includes:

s21: analyzing the relations among verbs, noun phrases, verbs and corresponding nouns in each sentence;

s22: taking the whole sentence as a root node, taking verbs as action nodes to be connected with the root node, taking noun phrases as entity nodes to be connected with the corresponding action nodes, and taking edges connecting the action nodes with the entity nodes as semantic relations to form different semantic hierarchies;

s23: embedding action nodes and entity nodes in each sentence into a vector based on a bidirectional LSTM model, and recording word vectors of each word in the sentenceW=[

,...

]；

S24: aggregating the word vectors of all words into a global event node vector through an attention mechanism;

s25: obtaining word vectors after vector embedding of action nodes and entity nodes in sentences, and performing maximum pooling operation to correspondingly obtain verb node vectors and entity node vectors;

s26: and constructing a semantic topological graph based on the global event node vector, the verb node vector and the entity node vector, inputting the global event node vector, the verb node vector and the entity node vector into the RGCN model as initialized node vectors, and updating the nodes based on the attention mechanism to obtain word feature vectors.

Preferably, the calculation formula of embedding the action node and the entity node into one vector is as follows:

in the formula (I), the compound is shown in the specification,

is an embedded matrix of word vectors that is,

is a learnable bias parameter for the forward LSTM network,

is a learnable bias parameter for an inverse LSTM network,

a word vector representing the word in the forward computing sentence i,

representing a word vector that calculates words in sentence i in reverse,

a word vector representing the words in the forward computed sentence i-1,

representing the word vector of the word in the inverse computation sentence i +1,

representing the final word vector representation;

global event node vector

The calculation formula is as follows:

in the formula (I), the compound is shown in the specification,

a matrix of learning parameters is represented that is,

a vector of words is represented that is,

represented in a semantic topological graph with

A word vector of words with connection relation, N represents the sum in the whole semantic topological graph

The number of words with connection relation;

verb node vector

Vector of physical nodes

Where Na and Ne represent the number of verb nodes and entity nodes, respectively.

Preferably, S26 is specifically:

s261: representing a node vector as

And inputting the node vector into the RGCN model, wherein the calculation formula is as follows:

wherein

Is a one-hot type vector of the semantic relationship corresponding to the nodes i and j,

representing the initialized node vector after the initialization of the node vector,

is a relational transformation matrix;

s262: suppose that the ith node is atlThe RGCN of a layer has an output of

The calculation process after the attention mechanism is adopted is as follows:

wherein

Represents a neighbor node of the node i,

and

is the attention mechanism parameter, D represents the number of nodes,

representing the attention coefficient between the ith word and the jth word;

according to the operation rule of the RGCN, the update rule of the node is as follows:

in the formula (I), the compound is shown in the specification,

is shown aslA learnable parameter matrix of the +1 layer network;

assuming that the number of layers of the RGCN is L, the node vector output by the L-th layer is the word feature vector, wherein

A vector of global nodes is represented that is,

a set of motion node vectors is represented,

representing a set of entity node vectors.

Preferably, S3 specifically includes:

s31: calculating the similarity of the global vector and the global node through a cosine function in the global layer to obtain a global matching score

；

Respectively obtaining an action matching score and an entity matching score by locally matching a track and a text in an action layer and an entity layer;

s32: and carrying out weighted summation on the global matching score, the action matching score and the entity matching score to obtain an overall similarity score.

Preferably, in S31, the global matching score

The calculation formula is as follows:

=

the specific calculation process of the action matching score and the entity matching score is as follows:

calculating a pair of local cross-modal matching scores by the following formula:

=

，

wherein each text node vector

Wherein x is

{ a, e }, vector of each trace point

Wherein x is

{a,e}；

And carrying out normalization processing on the matching score, wherein the calculation formula is as follows:

=softmax(

)

wherein

，

Is an attention parameter, which indicates whether the word node i has a relationship with the track node j,

representing a learnable neural network parameter;

calculating the matching score of the word node i to the whole track, wherein the calculation formula is as follows:

=

；

calculating a local matching score of the text and the track as

=

；

When x is a, the local matching score is the action matching score Sa, and when x is e, the local matching score is the entity matching score Se.

Preferably, the overall similarity score

The calculation formula is as follows:

。

the invention has the following advantages:

(1) The method comprises the steps of dividing all words of a text into three different semantic levels by using a natural language processing technology of semantic annotation, constructing a semantic topological graph according to a semantic role relationship, fully extracting fine-grained information of the text, and simultaneously keeping a topological structure of the text, so that the learned text features are more accurate.

(2) Matching tracks and text from both global and local dimensions, and weighting and summing matching scores. The global is responsible for the matching degree between the track and the whole text and the local is responsible for the matching degree between the track point and the word, so that the interpretability of track text matching is increased while the matching accuracy is improved.

(3) By adopting the idea of multi-mode representation learning and the method of mapping the track and the text to the same vector space, the multi-target interaction information contained in the text is fused on the basis of fully excavating the characteristics of the track, so that the accuracy of track anomaly detection can be effectively improved. Meanwhile, the tracks and the texts are matched from the global dimension and the local dimension, so that the matching accuracy is improved, and the interpretability is provided for the matching result.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of a multi-modal-based track text matching method provided by the invention.

Fig. 2 is a schematic diagram illustrating the track feature extraction provided by the present invention.

Fig. 3 is a schematic diagram illustrating text layering embedding provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The embodiment of the invention discloses a multi-mode-based track text matching method, which comprises the following steps of:

s2: performing semantic annotation on a text, dividing words of the text into different semantic levels according to a semantic structure, constructing a semantic topological graph, aggregating nodes of the semantic topological graph based on an RGCN (reduced graphics context) model, and updating the nodes based on an attention mechanism to obtain word feature vectors;

In this embodiment, as shown in fig. 2, the specific implementation process of S1 is as follows:

s11: three types of dimensional information data of the track data are extracted, the three types of dimensional information comprise track space dimensions, track state dimensions and track attribute dimensions, time, longitude, latitude, height and the like are track space dimensional data, course, speed and the like are track state dimensional data, flight numbers, call signs, departure places, destinations and the like are track attribute dimensional data, the extracted three types of dimensional information data are weighted and fused, and the characteristic extraction formula and the weighted fusion formula are shown as formulas (1), (2), (3) and (4):

（1）

（2）

（3）

（4）

wherein

Represents the learning result of the three kinds of dimensional information data,

three separate learning parameter matrices are shown,

represents three types of dimensional information data,

in order to be a function of fusion,

representing different fusion parameters;

s12: calculating the characteristic vector of each track point in one track information, aggregating the characteristic vectors of all track points into a global vector in a global layer, aggregating the characteristic vectors of all track points into an action vector in an action layer, aggregating the characteristic vectors of all track points into an entity vector in a physical layer, and obtaining track characteristic matrixes of different semantic layers. The method specifically comprises the following steps:

，

A vector group e of one track = [ ]

,...

]M represents the number of track points in a track, e is a vector group of a track,

vector of the trace point;

s122: for matching with texts with different semantic levels, three independent weight parameters are used

，

And

processing the characteristic vectors of the track points, wherein the calculation formula is as follows:

（5）

wherein

，

And

representing three different learnable weight parameter matrices,

representing the feature vectors of the trace points of the global layer,

the feature vector of the trace point of the action layer is represented,

[

,...

[

,...

]And obtaining the trace feature matrixes of different semantic layers.

In this embodiment, the track description text naturally contains a hierarchical structure, as shown in FIG. 3. The whole sentence describes the global event, the sentence is composed of a plurality of verbs, and each verb is composed of a corresponding subject and an object. The track description text can be quickly and comprehensively understood through the method which comprises the global description and the local analysis. S2 thus describes in particular how to generate such a hierarchical text representation containing both global descriptions and local analyses:

s21: first, for a sentence C = [, ]

,...

]And N represents the number of words in the sentence, and the whole sentence C is taken as a root node of the graph. Inputting the sentence C into the ready-made semantic analysis tool to obtain the semantic roles of the verbs, the noun phrases, and the noun phrase of each corresponding verb in the sentence.

S22: each verb is used as an action node to be directly connected with a root node, each noun phrase is used as an entity node to be connected with the corresponding action node, and because the action node and the entity node have a corresponding semantic relationship, the edges connected by the action node and the entity node are used for recording the corresponding semantic relationship;

,...

]The calculation results are as follows:

in the formula (I), the compound is shown in the specification,

is a word vectorThe embedded matrix of (a) is embedded,

is a learnable bias parameter for the forward LSTM network,

is a learnable bias parameter for an inverse LSTM network,

a word vector representing the word in the forward computing sentence i,

representing a word vector that calculates words in sentence i in reverse,

a word vector representing the word in the forward computation sentence i-1,

representing the final word vector representation;

s24: for the root node, because all descriptions of the text need to be obtained, all word vectors are aggregated into one vector through an attention mechanism, and at this time, a global event node vector needs to be calculated

The calculation process is as follows:

（9）

（10）

in the formula (I), the compound is shown in the specification,

a matrix of learning parameters is represented and,

a vector of words is represented that is,

represented in a semantic topological graph with

The number of words with connection relation;

s25: for verb nodes and entity nodes, corresponding word vectors are mapped on the basis of LSTM embedding

Performing maximum pooling operation to obtain verb node vector

And entity node vector

Wherein Na and Ne represent the number of verb nodes and entity nodes, respectively;

s26: and inputting the global event node vector, the verb node vector and the entity node vector as initialized node vectors into the RGCN model, and updating the nodes based on the attention mechanism to obtain word feature vectors.

Specifically, the S26 is implemented as follows:

s261: representing a node vector as

（11）

wherein

Is a one-hot type vector of semantic relation corresponding to the nodes i and j,

is a relational transformation matrix;

s262: suppose that the ith node is atlThe RGCN of a layer has an output of

The calculation process after the attention mechanism is adopted is as follows:

（12）

（13）

wherein

Representing the neighbor nodes of the node i,

and

is a parameter of the attention mechanism and,d represents the number of nodes and is,

representing an attention coefficient between the ith word and the jth word;

（14）

in the formula (I), the compound is shown in the specification,

is shown aslA learnable parameter matrix of the +1 layer network;

assuming that the number of layers of the RGCN is L, the node vector output by the L-th layer is a word feature vector;

wherein

A vector of global nodes is represented that is,

a set of motion node vectors is represented,

representing a set of entity node vectors.

The invention converts the matrix

The method is set as the shared parameter, namely all the relations share the conversion matrix, and solves the problems that in the prior art, an R-GCN model needs to calculate a relation matrix and a conversion matrix for each relation, so that a large number of parameters needing to be calculated are generated, the calculation efficiency of the model is seriously influenced, and overfitting in the training process is caused.

In this embodiment, the specific implementation process of S3 is as follows:

s31: at the global levelBecause the track and the text are coded into a one-dimensional global vector, the similarity of the two vectors can be measured by directly using a cosine function, and the similarity of the global vector and the global node is calculated through the cosine function to obtain a global matching score

=

；

At the action and entity level, matching of tracks and texts needs to be performed locally. For each text node vector

Wherein x is

{ a, e }, vector of each trace point

Wherein x is

{ a, e }; first, a pair of local cross-modal matching scores is calculated

=

. In order to show which track points in the word bottom and the track have relations, the score needs to be normalized, and the formula is as follows:

=softmax(

) （15）

wherein

。

In effect, an attention parameter, which indicates whether the word node i has a relationship with the track node j,

representing a learnable neural network parameter.

Calculating the matching score of the word node i to the whole track

=

。

Get a local match score of text and track as

=

。

S32: weighting and summing the global matching score, the action matching score and the entity matching score to obtain an overall similarity score

。

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-mode-based track text matching method is characterized by comprising the following steps:

s3: calculating the similarity of the track feature matrix and the word feature vector of the corresponding level, and performing weighted summation to obtain an overall similarity score;

s1 comprises the following steps:

s12: calculating a feature vector of each track point in one track information, aggregating the feature vectors of all track points into a global vector on a global layer, aggregating the feature vectors of all track points into an action vector on an action layer, aggregating the feature vectors of all track points into an entity vector on a physical layer, and obtaining track feature matrixes of different semantic layers;

s2 specifically comprises the following steps:

s21: analyzing the relations among the verbs, the noun phrases, the verbs and the corresponding nouns in each sentence;

s22: taking the whole sentence as a root node, taking a verb as an action node to be connected with the root node, taking a noun phrase as an entity node to be connected with the corresponding action node, and taking the edge connecting the action node and the entity node as a semantic relation to form a different semantic hierarchy;

s23: embedding action nodes and entity nodes in each sentence into a vector based on a bidirectional LSTM model;

s26: and constructing a semantic topological graph based on the global event node vector, the verb node vector and the entity node vector, inputting the global event node vector, the verb node vector and the entity node vector into the RGCN model as initialized node vectors, and updating the nodes based on the attention mechanism to obtain a word feature vector.

2. The multi-modality based track text matching method as claimed in claim 1, wherein S11 specifically includes:

e ^a ＝E ₁ (X ^a )

e ^b ＝E ₂ (X ^b )

e ^c ＝E ₃ (X ^c )

the weighted fusion formula is:

e _i ＝δ([e ^a *α _a ，e ^b *α _b ，e ^c *α _c ])

wherein e ^x X ∈ (a, b, c) denotes a learning result for three types of dimensional information data, E _x X ∈ (1, 2, 3) denotes three independent learning parameter matrices, X ^x X belongs to (a, b, c) represents three kinds of dimension information data, delta is a fusion function, and alpha is _x And x ∈ (a, b, c) denotes different fusion parameters.

3. The multi-modality-based track text matching method as claimed in claim 2, wherein the S12 specifically includes:

s121: calculating the characteristic vector e of each track point in one piece of track information _i I ∈ (1, 2,. M), vector set e = [ e ] for one track ₁ ，e ₂ ,...e _M ]M represents the number of trace points in a trace, e is a vector group of a trace, e _i The vector of the track point is obtained;

And

wherein

And

representing three different learnable weight parametersNumber matrix, v _c，i Representing feature vectors of trace points of the global layer, v _a，i Feature vector representing trace point of action layer, v _e，i Representing the characteristic vector of the trace point of the physical layer;

s123: aggregating the feature vectors of all track points into a global vector V in a global layer _c Aggregating all track point feature vectors into an action vector V in the action layer _a ＝[v _a，1 ，v _a，2 ,...v _a，M ]And aggregating the feature vectors of all track points into a solid vector V at a solid layer _e ＝[v _e，1 ，v _e，2 ,...v _e，M ]And obtaining the trace feature matrixes of different semantic layers.

4. The multi-modal-based track text matching method according to claim 3, wherein a word vector W = [ W ] of each word in the sentence ₁ ，w ₂ ,...w _N ]。

5. The multi-modal-based track text matching method according to claim 4, wherein the calculation formula of embedding the action nodes and the entity nodes into one vector is as follows:

in the formula, W _C Is an embedded matrix of word vectors that is,

is a learnable bias parameter for a forward LSTM network，

Is a learnable bias parameter for the reverse LSTM network,

a word vector representing the word in the forward computing sentence i,

a word vector representing the words in the reverse-calculated sentence i,

a word vector representing the word in the forward computation sentence i-1,

word vector, w, representing the word in the inverse computation sentence i +1 _i Representing the final word vector representation;

global event node vector g _c The calculation formula is as follows:

in the formula, W _e Representing a learning parameter matrix, w _i Representing a word vector, w _j Represented in a semantic topological graph with w _i A word vector of the words with connection relation, N represents the sum w in the whole semantic topological graph _i The number of words with connection relation;

verb node vector g _a ＝[g _a，1 ，g _a，2 ，...g _a，Na ]Vector of physical nodes g _e ＝[g _e，1 ，g _e，2 ，...g _e，Ne ]Which isMiddle Na and Ne represent the number of verb nodes and entity nodes, respectively.

6. The multi-modal-based track text matching method according to claim 5, wherein S26 is specifically:

s261: representing the node vector as g _i ∈{g _c ，g _a ，g _e And the node vector is input into an RGCN model, and the calculation formula is as follows:

wherein r is _i，j Is a one-hot type vector of the semantic relationship corresponding to the nodes i and j,

representing the initialized node vector, W _r Is a relational transformation matrix;

s262: suppose the output of the RGCN of the ith node at the l layer is

The calculation process after the attention mechanism is adopted is as follows:

wherein N is _i Representing the neighbor nodes of the node i,

and

is an attention mechanism parameter, D represents the number of nodes, α _ij Representing the attention coefficient between the ith word and the jth word;

in the formula, W _t ^l+1 A learnable parameter matrix representing a layer l +1 network;

A vector of global nodes is represented that is,

a set of motion node vectors is represented,

representing a set of entity node vectors.

7. The multi-modality-based track text matching method as claimed in claim 6, wherein S3 specifically comprises:

s31: calculating the similarity of the global vector and the global node through a cosine function in the global layer to obtain a global matching score s _c ；

Respectively obtaining an action matching score and an entity matching score by matching a track and a text in a local part at an action layer and an entity layer;

8. The method as claimed in claim 7, wherein the method comprises a step of matching the trajectory text based on multiple modesIn S31, the global matching score S _c The calculation formula is as follows:

s _c ＝cos(t _c ，v _c )

wherein each text node vector t _x，i ∈t _x Where x ∈ { a, e },

vector v of each trace point _x，j ∈V _x Where x ∈ { a, e };

wherein [. ]] ₊ ＝max(·，0)，

Is an attention parameter that indicates whether the word node i has a relationship with the track node j; λ represents a learnable neural network parameter;

calculating a local matching score s of the text and the track _x ＝∑ _i s _x，i ；

9. The multi-modal-based track text matching method according to claim 8, wherein the overall similarity score s (v, t) is calculated by the formula:

s(v，t)＝(s _c ，s _a ，s _e )/3。