CN116680343A

CN116680343A - Link prediction method based on entity and relation expression fusing multi-mode information

Info

Publication number: CN116680343A
Application number: CN202310641906.0A
Authority: CN
Inventors: 田紫暄; 金福生; 徐源; 袁野; 王国仁
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-09-01

Abstract

The invention relates to the technical field of knowledge graph knowledge reasoning, in particular to a link prediction method based on entity and relationship expression fusing multi-mode information, which comprises the following steps: collecting image data, text data and triplet data related to a knowledge-graph subject to be constructed; carrying out knowledge extraction and entity alignment on the preprocessed triplet data; extracting features of the image data to generate a visual representation; extracting characteristics of the text data and the triplet data to generate text representation; the generated visual representation, text representation and triplet data are taken as input together, a fusion module is trained, and entity and relation vector representations containing multi-mode information are learned; and decoding the feature representation learned by the fusion module through a decoding part, carrying out link prediction, and outputting the probability of predicting as a positive triplet. The invention can improve the accuracy of the link prediction task and the interpretive of multi-modal knowledge representation learning.

Description

Link prediction method based on entity and relation expression fusing multi-mode information

Technical Field

The invention relates to the technical field of knowledge graph knowledge reasoning, in particular to a link prediction method based on entity and relationship expression fusing multi-mode information.

Background

The traditional manual multi-mode knowledge graph construction method cannot well contain knowledge of all modes, noise information exists among different modes, graphs are sparse, and error triples possibly exist, so that research on multi-mode knowledge reasoning technology is needed. The reasoning technology facing the multi-mode knowledge graph can mainly assist in reasoning out new facts, relations, axioms, rules and the like, and more mainly predict the missing part of the triples. Link prediction is an important task in knowledge pushing, and the goal is to predict entities or potential relationships missing in a knowledge graph, so as to enrich and perfect the knowledge graph.

With the development of knowledge-graph reasoning, many methods have emerged in recent years, including graph structure and rule-based reasoning, knowledge-graph representation learning-based reasoning, neural network-based reasoning, and hybrid reasoning. The reasoning methods have advantages and disadvantages, the construction of the multi-mode knowledge graph is a very complex process, entities, attributes and relations need to be extracted from a large amount of structured, semi-structured and unstructured data, and the link prediction according to the entity and relation expression fused with the multi-mode information can be regarded as a sub-problem of knowledge reasoning in the multi-mode knowledge graph. At present, a knowledge reasoning method based on multi-mode knowledge representation learning is mainly used for the multi-mode knowledge map, but the following defects exist when the knowledge reasoning method based on multi-mode knowledge representation learning is used for solving the problems:

(1) After the multi-mode data is introduced, the quality problems related to the multi-mode data inevitably exist, and the complexity of each mode data and the noise caused by information interaction between the multi-mode data are not considered.

(2) The existing method focuses on how to obtain more auxiliary information from two multi-mode data, namely an image and a text, and ignores the structural information in the knowledge graph, so that the reasoning process lacks a certain interpretability.

(3) Since most of the relationships between entities and images are one-to-many, i.e. one entity corresponds to a plurality of different images associated with the entity, the prior art does not consider how to better utilize the information in these images.

Disclosure of Invention

In view of the above, the present invention provides a link prediction method based on entity and relationship expression fusing multimodal information, which can improve accuracy of a link prediction task and can improve interpretability of multimodal knowledge expression learning.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a link prediction method based on entity and relation expression fusing multi-mode information comprises the following steps:

collecting multi-mode data related to a knowledge graph subject to be constructed, and preprocessing; the multimodal data includes image data, text data, and triplet data;

performing entity identification, relation extraction and attribute extraction on the preprocessed triplet data, and aligning image data and text data related to the entities in the triplet;

the method comprises the steps of carrying out single-mode feature extraction on image data through a vision module, and learning key features related to entities in the image data to generate a vision representation;

the text module is used for carrying out single-mode feature extraction on the text data and the triplet data to generate text representation capable of reflecting semantic information;

the generated visual representation, text representation and triplet data are taken as input together, a fusion module is trained, and entity and relation vector representations containing multi-mode information are learned;

and decoding the feature representation learned by the fusion module through a decoding part, carrying out link prediction, and outputting the probability of predicting as a positive triplet.

Further, the preprocessing includes: and performing data cleaning, data conversion and data integration operation on the image data, the text data and the triplet data respectively by using an open source tool.

Further, after the image data and the text data related to the entity in the triplet are aligned with the image data and the text data, the triplet and the image data and the text data corresponding to the entity in the triplet are taken as a multi-modal total data set together, and the multi-modal total data set is randomly divided into a training set, a verification set and a test set.

Further, the visual module comprises an input, a preference module and a visual encoder; the process of extracting the characteristics of the image data through the vision module comprises the following steps:

and taking a plurality of image data corresponding to the same entity as input, sending the input data into the preferential module to obtain optimal image data, dividing the optimal image data into small-size blocks as input of the visual encoder, and outputting visual representation through the visual encoder.

Further, in the preferred module, irrelevant images and low-quality images are screened out through two steps of similarity calculation and definition evaluation, and a relatively optimal image is reserved as a subsequent input; the method specifically comprises the following steps:

performing similarity calculation on a plurality of images corresponding to the same entity by adopting a perceptual hash algorithm, obtaining the similarity between the images by calculating the Hamming distance, and screening out images with over-high similarity and irrelevant images;

performing definition evaluation by adopting a gray level difference absolute value summation function, performing difference on adjacent pixels in the horizontal and vertical directions of an image, accumulating after taking the absolute value, taking the accumulated value as the representation of the definition of the image, screening out an image with optimal definition, and dividing the image into small-size blocks as input vectors of the visual encoder;

the visual encoder adopts an encoder structure in a transducer architecture, and the specific encoding process is as follows:

the input vector is subjected to multi-head attention layer and residual connection and layer normalization operation, and then is subjected to feedforward neural network and residual connection and layer normalization operation, so that visual representation is obtained.

Further, if the same entity has only one piece of corresponding image data, the image data is directly used as the input of the visual encoder;

if the same entity contains zero corresponding image data, the inputs of the visual encoder are all filled with 0 s.

Further, the text feature extraction process by the text module comprises the following steps:

dividing a text sequence in text data into a plurality of sentences, adding a [ CLS ] mark at the beginning of the whole text sequence, and adding a [ SEP ] mark between two sentences and at the end of the whole sequence;

splicing and converting the triplet data into a text sequence through a [ CLS ] 'mark and a [ SEP ]' mark;

the text is embedded, the position code and the mark code are used as input vectors together and are sent into a text encoder;

the text encoder adopts an encoder structure in a transducer architecture, input vectors are subjected to multi-head attention layer and residual connection and layer normalization operation, and then subjected to feedforward neural network and residual connection and layer normalization operation, so that text representation is obtained.

Further, the fusion module consists of a visual fusion encoder, a central encoder and a text auxiliary encoder;

the text auxiliary encoder adopts an encoder structure of a transducer, input text represents the normalization operation of a multi-head attention layer and residual connection and layer, and then the characteristic vector of the text is output through the normalization operation of a feedforward neural network and residual connection and layer;

the visual fusion encoder adopts an encoder structure of a transducer, an input image representation is firstly subjected to multi-head attention layer and residual error connection and layer normalization operation, then subjected to feedforward neural network and residual error connection and layer normalization operation, and finally a feature vector of the image is output;

the central encoder adopts an encoder structure in a transducer architecture, a text feature vector output by the text auxiliary encoder and an image feature vector output by the visual fusion encoder are subjected to multi-mode attention layer and residual connection and layer normalization operation, and then subjected to feedforward neural network and residual connection and layer normalization operation, and entity and relation vector representation is output.

Further, the operation process of the multi-mode attention layer is as follows:

calculating a text modal attention value;

calculating a visual modality attention value;

for each input triple representation, firstly obtaining an initial representation of a triple structure through a linear transformation matrix, and obtaining a triple attention value through a LeakyRelu nonlinear layer and then a Softmax layer;

respectively taking the text modal attention value, the visual modal attention value and the triplet attention value as weights of text feature vectors, visual feature vectors and triplet structure feature vectors, and carrying out weighted summation and averaging to obtain multi-modal representation of the entity;

the multi-modal representation of the entity and the original characteristic representation of the entity are weighted and summed to obtain the final vector representation of the entity;

for multi-modal representation of the relation between the entities, ternary group structure information is taken as multi-modal representation, and the original representation of the relation and the multi-modal representation are subjected to weighted summation to obtain vector representation of the final relation;

the vector representation of the entities and relationships is taken as the output of the fusion module.

Further, the decoding part uses a decoder in a Transformer network architecture, and the decoding process is as follows:

inputting a target output sequence, and carrying out layer normalization operation through a mask multi-head attention layer and residual connection;

the output vector of the upper layer is taken as an input vector together with the output of the fusion module, and the multi-head attention layer and residual error connection and layer normalization operation are carried out;

carrying out layer normalization operation on the output vector of the upper layer through a feedforward neural network and residual connection;

and the output vector of the upper layer passes through 1 full-connection layer and 1 softmax layer, and a final probability output result is obtained as the output of the decoding part.

Compared with the prior art, the method for predicting the link based on the entity and the relation expression fusing the multi-modal information is disclosed, wherein in the model training process, the translation model idea is introduced to restrict the distance among the head entity, the relation and the tail entity, and the accuracy of the link prediction is improved by optimizing the expression of the multi-modal knowledge, so that the interpretation of the multi-modal knowledge expression learning can be improved, and the method has certain robustness on the premise of guaranteeing the final effect in practical application.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram provided by the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the embodiment of the invention discloses a link prediction method based on entity and relationship expression fusing multi-modal information, which comprises the following steps:

s1, collecting multi-mode data related to a knowledge graph subject to be constructed, and preprocessing; the multimodal data includes image data, text data, and triplet data;

s2, carrying out entity identification, relation extraction and attribute extraction on the preprocessed triplet data, and aligning image data and text data related to the entities in the triplet;

s3, performing single-mode feature extraction on the image data through a visual module, and learning key features related to the entity in the image data to generate visual representation;

s4, carrying out single-mode feature extraction on the text data and the triplet data through a text module to generate text representation capable of reflecting semantic information;

s5, taking the generated visual representation, text representation and triplet data as inputs, training the fusion module, and learning the entity and relation vector representation containing the multi-mode information;

and S6, decoding the feature representation learned by the fusion module through a decoding part, carrying out link prediction, and outputting the probability of predicting as a positive triplet.

The steps described above are further described below.

In S1, a large amount of multi-modal data related to a certain news event is collected, including text data, triplet data, and image data. Multimodal data relating to a desired news event may be found from published web sites and GDELT databases, including, but not limited to, news image data, news headline data, image description data, CSV data (form data), and the like.

After a large amount of multimodal data is collected, it is preprocessed, including: and respectively performing data cleaning, data conversion and data integration operation on the image data, the text data and the triplet data by using an open source tool to obtain high-quality multi-mode data.

The method comprises the following steps: preprocessing image data by using a method of directly compressing the original image length-width ratio into the input size 224 x 224 of the network, removing invalid information in text data by using a re.sub () method in python, deleting invalid fields in structured data and completely consistent repeated data, replacing or deleting special symbols and invalid characters existing in text data and triplet data by using a regular expression and character string replacement method, simultaneously converting category information existing in a numerical form into a corresponding text form, and finally integrating the preprocessed data information into a csv file.

In a specific embodiment, in S2, a deep ke tool may be used to implement named entity recognition, relationship extraction, and attribute extraction on the triplet structure data in the preprocessed high-quality news event multi-modal dataset, and align the image data and text data related thereto.

After aligning the image data and text data related to the entity in the triplet, the triplet and the image data and text data corresponding to the entity therein are used together as a multi-modal total data set, and the multi-modal total data set is randomly divided into a training set, a verification set and a test set. Wherein the training set is 80% of the total data set, the verification set is 10% of the total data set, and the test set is 10% of the total data set.

In a specific embodiment, in S3, the visual module includes an input, a preference module, and a visual encoder; the process of extracting the characteristics of the image data through the vision module comprises the following steps:

and taking a plurality of image data corresponding to the same entity as input, sending the input data into a preferential module to obtain optimal image data, dividing the optimal image data into small-size blocks as input of a visual encoder, and outputting visual representation through the visual encoder.

Specifically, because each entity corresponds to more than one image and the image quality is staggered, in the preferred module, irrelevant images and low-quality images are screened out through two steps of similarity calculation and definition evaluation, and a relatively optimal image is reserved as a subsequent input, so that the noise influence caused by low-quality image data can be reduced; the method specifically comprises the following steps:

firstly, a perception hash algorithm is adopted to calculate the similarity of a plurality of images corresponding to the same entity, the images are converted from a pixel domain to a frequency domain through discrete cosine transform, because human eyes are insensitive to high-frequency detail information, the image content can be generally identified only according to low-frequency information, therefore, only a low-frequency region with the upper left corner of 8 x 8 is selected, 64 pixels are used, discrete cosine transform values of each pixel are calculated respectively, average values are obtained for all 64 values, then the size of each discrete cosine transform value is compared with the size of the average value to obtain a hash value, the calculation method of the hash value is that 64-bit hash values of 0 or 1 are set, wherein the value larger than the average value is set to 1, and the value smaller than the average value is set to 0; and calculating the Hamming distance to obtain the similarity between the images, namely calculating the number of different characters at the positions corresponding to the two hash values, if the same data bit number ratio is greater than 0.92, the two images are similar, and if the same data bit number ratio is less than 0.84, the two images are different, so that the images with over-high similarity and the completely irrelevant images are screened.

The specific discrete cosine transform value calculation method is shown in the formula:

wherein N refers to the total number of one-dimensional data elements, the value of N in this example is 8, f (i, j) is the element of the input data at the (i, j) position, and the coefficients c (u) and c (v) function to transform the discrete cosine transform matrix into an orthogonal matrix; u and v denote coordinates in the frequency domain.

Performing definition evaluation by adopting a gray level difference absolute value summation function, performing difference on adjacent pixels in the horizontal and vertical directions of an image, accumulating after taking the absolute value, taking the accumulated value as the representation of the definition of the image, screening out an image with optimal definition, and dividing the image into small-size blocks (such as small blocks with the size of 14 x 14) as input vectors of a visual encoder;

if the same entity has only one piece of corresponding image data, the image data is directly used as the input of a visual encoder;

the input vector in the visual encoder is normalized through the multi-head attention layer and residual connection and layer, then the output vector is obtained through the feedforward neural network and the residual connection and layer normalization operation, and finally the visual representation is obtained through the 6 identical visual encoders.

Residual connection and layer normalization operations: the residual connection means adding the input and output of the network, namely that the output of the network is F (x) +x, when the network structure is deeper, the problem of gradient disappearance is easy to be caused when the network gradient counter-propagates the updated parameters, but if one x is added to the output of each layer, the result of deriving the x is 1, so that the method is equivalent to adding one constant term '1' to the result of deriving the each layer, and the problem of gradient disappearance is effectively solved. Layer normalization refers to separately averaging features for each token, normalizing the output to a standard normal distribution, such that the input to the next layer remains relatively stable.

The visual encoder is trained by adopting the cross entropy loss function as an objective function, and the parameters of the visual encoder are continuously adjusted and optimized in the training process until the objective function converges.

In one embodiment, in S4, the text feature extraction process performed by the text module is:

dividing a text sequence in text data into a plurality of sentences by adopting a BERT method, adding a [ CLS ] mark at the beginning of the whole text sequence, and adding a [ SEP ] mark at the middle of the two sentences and at the end of the whole sequence;

for the triplet data, sequentially splicing and converting the triplet data into a text sequence through a [ CLS ] mark and a [ SEP ] mark; the method comprises the steps of adding a ' SEP ' mark among a head entity, a relation and a tail entity, wherein the mark coding calculation mode is expanded from the original value of a first sentence of all the keys (comprising a ' CLS ' mark and a subsequent first sentence of a ' SEP ' mark) to 0, and the value of a second sentence of all the keys (comprising a subsequent second sentence of a ' SEP ' mark) to 1' to be: "the values of all the tokens in the first sentence are 0, the values of all the tokens in the second sentence are 1, and the values of all the tokens in the third sentence are 0".

the text encoder adopts an encoder structure in a transducer architecture, in the encoder, input vectors are subjected to multi-head attention layer and residual connection and layer normalization operation, then subjected to feedforward neural network and residual connection and layer normalization operation, and subjected to 6 identical text encoders, so that text representation is finally obtained.

The text encoder is trained by adopting the cross entropy loss function as an objective function, and parameters of the text encoder are continuously adjusted and optimized in the training process until the objective function converges.

In one embodiment, in S5,

the fusion module consists of a visual fusion encoder, a central encoder and a text auxiliary encoder, S3 and S4 are coarse granularity feature extraction, the obtained image, text and the representation of the triplet are used as input, the fusion module is trained, and the low-dimensional vector representation of the entity and the relation is used as output; the visual fusion encoder and the text auxiliary encoder in the fusion module further extract the fusion characteristics of the coarse granularity characteristics extracted in the previous step.

The input text is represented by a text auxiliary encoder, the text auxiliary encoder adopts a encoder structure of a transducer, firstly, the normalization operation is carried out on a multi-head attention layer and residual connection and layer, and then, the normalization operation is carried out on the feedforward neural network and the residual connection and layer, so that the feature vector of the text is output; text-assisted attention is given as the formula:

wherein x is ^t Representing a vector for the entered text;and->Is a weight matrix.

The input image representation passes through a visual fusion encoder which adopts a encoder structure of a transducer, firstly, the multi-head attention layer and residual connection and layer normalization operation are carried out, wherein when the multi-head attention layer is calculated, a PGI method in an MKGformer model is adopted, a matrix K and a matrix V of standard attention in a text auxiliary encoder are transferred into the visual fusion encoder, and then, the characteristic vector of the image is finally output through a feedforward neural network and residual connection and layer normalization operation. Visual attention is:

wherein x is ^v Representing vectors for an input image;is->Is a weight matrix.

The central encoder also uses the encoder structure in the transducer architecture, firstly performs normalization operation on the multi-modal attention layer and residual connection and layer, then performs normalization operation on the feedforward neural network and residual connection and layer, performs training by using the maximum interval loss function as an objective function, continuously adjusts and optimizes encoder parameters in the training process until the objective function converges, and outputs the representation vectors of entities and relations.

Wherein the maximum interval loss function learns a higher quality representation generation by maximizing the interval between positive and negative samples, the basic idea is: for the triples < h, l, t > in a knowledge graph, training and optimizing to make the scores higher, and making the scores of the triples outside the knowledge graph lower, and maximizing the distance between the triples to achieve the purpose of optimizing knowledge representation, specifically:

S′ _h,l,t ＝{(h′,l,t|h′∈E)}∪{(h,l,t′|t′∈E)}

wherein S is the correct triplet, S' is the incorrect triplet obtained by replacing h or t, γ is the spacing distance hyper-parameter, [ x ] + is the positive function, i.e. when x >0, [ x ] +=x; when x is less than or equal to 0, [ x ] +=0.

The operation process of the multi-mode attention layer is as follows:

computing text modality attention head _t Taking the output of the text auxiliary encoder as input;

wherein Qv, kv and Vv are respectively matrixes corresponding to the query vector, the key vector and the value vector, W ^Q WK and VV are weight matrices.

Calculating a visual modality attention head _v Taking the output of the visual fusion encoder as input;

wherein Qt, kt and Vt are respectively matrixes corresponding to query vector, key vector and value vector, W ^Q WK and VV are weight matrices.

For each triplet representation of the input<t ^headentity _i ，t ^relation _j ，t ^tailentity _k >First, obtaining initial representation head of triple structure through linear transformation matrix ⁰ _ijk Obtaining head after the LeakyRelu nonlinear layer ¹ _ijk Obtaining the triad attention value head through the Softmax layer _ijk ；

Wherein W is ₁ 、W ₂ Is a linear transformation matrix.

The text modal attention value, the visual modal attention value and the triplet attention value are respectively used as weights of text feature vectors, visual feature vectors and triplet structure feature vectors, and are weighted, summed and averaged to obtain the multi-modal representation E of the entity _Multi ；

The multi-modal representation of the entity and the original characteristic representation of the entity are weighted and summed to obtain a final vector representation E of the entity;

E＝αt _entity +(1-α)E _Multi ,0<α<1

for multi-modal representation of relationships between entities, triad structure information is taken as multi-modal representation R _Multi Carrying out weighted summation on the original representation of the relation and the multi-modal representation to obtain a vector representation R of the final relation;

R＝βt _relation +(1-β)R _Multi ,0<β<1

wherein M is the number of modes, and the value is 3;is a visual representation vector; />Representing the vector for the text; t is t _entity Is a vector representation of an entity; t is t _relation Is a vector representation of the relationship; alpha and beta are weight parameters, randomly initializing at the beginning of training, flexibly distributing weights occupied by multi-mode information in the representation of entities and relations, and achieving the optimal in the training process.

In one embodiment, the decoding portion uses a decoder in a transducer network architecture, the decoding process being:

The embodiment of the invention also performs the following verification on the method.

The invention uses hardware as CPU: intel (R) Xeon (R) Gold5218CPU@2.30GHz, GPU: geForceRTX3090Ti, 24GB of video memory capacity and memory: 128576MB. The software is as follows: operating system: linuxUbuntu64, CUDA (11.1), cudnn (8.0), python (3.6), python (3.8); hit number @10 (HITS @ 10), average reciprocal rank (MRR), and average rank (MR) are used as evaluation indicators for link prediction in conjunction with entity and relationship representations of multimodal information. The number of hits @10 (HITS @ 10) is the average ratio of the correct triples in the top 10 predictions, the larger the index, the better; the average reciprocal rank (MRR) is the average reciprocal rank of the correct triplet, the larger the index the better; the average ranking (MR) is the average ranking of the correct triples, the smaller the index the better.

The prediction method of the present invention tests a set of test data with the existing method, and test results of each method are shown in table 1 below.

Table 1 comparison of test results of various methods

Compared with the existing method, the method is superior to the existing method in terms of each evaluation index of the link prediction under the condition that the final effect is not great, so that the method can improve the accuracy of the link prediction task; compared with partial existing methods, the method has the advantages that the central encoder adopts the maximum interval loss function, the distances between the head entity vector + the relation vector and the tail entity vector of the triples in the knowledge graph are minimized, and the distances between the triples outside the knowledge graph are maximized, so that the head entity, the tail entity and the vector representation of the relation between the head entity and the tail entity are constrained, the translation model thought is met, namely the relation is regarded as translation from the head entity to the tail entity, and therefore the interpretation of multi-modal knowledge representation learning can be improved; compared with partial existing methods, the method of the invention screens through the preferential module when the image data is too much, and performs the subsequent steps through the mode of filling 0 when the image data is missing, so the method of the invention can well solve the problem of poor model effect caused by noise influence caused by the image data, and the method of the invention has certain robustness on the premise of ensuring the final effect.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A link prediction method based on entity and relation expression fusing multi-mode information is characterized by comprising the following steps:

2. The method of link prediction based on an entity and relationship representation fusing multimodal information as claimed in claim 1, wherein the preprocessing comprises: and performing data cleaning, data conversion and data integration operation on the image data, the text data and the triplet data respectively by using an open source tool.

3. The method for link prediction based on fusion of entity and relational expression of multimodal information according to claim 1, wherein after the image data and text data related to the entity in the triplet are aligned therewith, the triplet and the image data and text data corresponding to the entity therein are taken together as a multimodal total data set, and the multimodal total data set is randomly divided into a training set, a verification set and a test set.

4. The method of claim 1, wherein the visual module comprises an input, a preference module, and a visual encoder; the process of extracting the characteristics of the image data through the vision module comprises the following steps:

5. The method for predicting a link based on an entity and a relationship expression fusing multimodal information as set forth in claim 4, wherein in the preferential module, irrelevant images and low-quality images are screened out through two steps of similarity calculation and sharpness evaluation, and a relatively optimal image is reserved as a subsequent input; the method specifically comprises the following steps:

the input vector is firstly operated through a multi-head attention layer and residual connection and layer, and then is normalized through a feedforward neural network and residual connection and layer, so that visual representation is obtained.

6. The method for predicting the link based on the entity and the relational expression fusing multimodal information as recited in claim 4, wherein if the same entity has only one piece of corresponding image data, the image data is directly used as the input of the visual encoder;

7. The method for predicting links based on the fusion of entity and relationship representations of multimodal information according to claim 1, wherein the text feature extraction by the text module is as follows:

8. The method for link prediction based on an entity and relationship representation fusing multimodal information as claimed in claim 1, wherein: the fusion module consists of a visual fusion encoder, a central encoder and a text auxiliary encoder;

9. The method for link prediction based on an entity and relationship representation fusing multimodal information as claimed in claim 1, wherein: the operation process of the multi-mode attention layer is as follows:

calculating a text modal attention value;

calculating a visual modality attention value;

10. The method for link prediction based on an entity and relationship representation fusing multimodal information as set forth in claim 1, wherein the decoding section uses a decoder in a fransformer network architecture, the decoding process being: