CN111563192A

CN111563192A - Entity alignment method and device, electronic equipment and storage medium

Info

Publication number: CN111563192A
Application number: CN202010349062.9A
Authority: CN
Inventors: 张涵宇; 李雪莲; 程序
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2020-08-21
Anticipated expiration: 2040-04-28
Also published as: CN111563192B

Abstract

The application relates to the technical field of artificial intelligence, and discloses an entity alignment method, an entity alignment device, electronic equipment and a storage medium, wherein the entity alignment is carried out by utilizing a machine learning technology in the artificial intelligence so as to improve the accuracy of describing an entity by a feature vector and improve the accuracy of the entity alignment, and the method comprises the following steps of: for each entity in any knowledge graph, obtaining a characteristic vector of the entity based on an embedded vector of the entity and an embedded vector of the entity having an association relationship with the entity in the knowledge graph, wherein the embedded vector of the entity is obtained based on attribute information of the entity; and determining the same entity in any two knowledge graphs based on the characteristic vector of each entity.

Description

Entity alignment method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an entity alignment method and apparatus, an electronic device, and a storage medium.

Background

With the development of the internet, the data information volume is explosively increased, the data is redundant and large in scale, and in order to solve the series of problems, the knowledge map technology which aims to describe entities existing in the real world and the relationship among the entities is developed. Most of the conventional entity alignment methods project attribute information of two entities to the same space, obtain feature vectors of the two entities in the same space, and determine whether the two entities are the same entity based on a distance value between the two feature vectors, wherein the smaller the distance value is, the greater the probability that the two entities are the same entity is. However, the conventional entity alignment method ignores the existing correlation property between entities, and reduces the accuracy of entity alignment.

Disclosure of Invention

The embodiment of the application provides an entity alignment method, an entity alignment device, electronic equipment and a storage medium, so as to improve the accuracy of describing an entity by a feature vector, and further improve the accuracy of entity alignment.

In one aspect, an embodiment of the present application provides an entity alignment method, including:

for each entity in any knowledge graph, obtaining a feature vector of the entity based on an embedded vector of the entity and an embedded vector of the entity having an association relationship with the entity in the knowledge graph, wherein the embedded vector of the entity is obtained based on attribute information of the entity;

and determining the same entity in any two knowledge graphs based on the characteristic vector of each entity.

In one aspect, an embodiment of the present application provides an entity alignment apparatus, including:

the system comprises a feature vector acquisition module, a feature vector acquisition module and a feature vector acquisition module, wherein the feature vector acquisition module is used for acquiring a feature vector of an entity based on an embedded vector of the entity and an embedded vector of an entity having an incidence relation with the entity in any knowledge graph aiming at each entity in any knowledge graph, and the embedded vector of the entity is acquired based on attribute information of the entity;

and the alignment module is used for determining the same entity in any two knowledge maps based on the characteristic vector of each entity.

In one aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of any one of the methods when executing the computer program.

In one aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, implement the steps of any of the above-described methods.

In one aspect, an embodiment of the present application provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when executed by a processor, implement the steps of any of the methods described above.

The entity alignment method, the device, the electronic equipment and the storage medium provided by the embodiment of the application can obtain the characteristic vector of the entity based on the embedded vector of the entity and the embedded vector of the entity having the incidence relation around the entity, and determine the same entity based on the obtained characteristic vector, so that the characteristic vector of the entity not only contains the attribute information of the entity, but also contains the attribute information of the entity associated with the entity, thereby enriching the information represented in the characteristic vector corresponding to the entity, improving the accuracy of describing the entity by the characteristic vector, performing entity alignment based on the characteristic vector, and being beneficial to improving the accuracy of entity alignment, and further performing knowledge-map fusion better. Furthermore, the embedded vector of the entity can be obtained based on the attribute information and the picture information of the entity, so that the extracted embedded vector contains information of the entity in various aspects, and the entity alignment accuracy is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of an entity alignment method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an entity alignment method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a process of obtaining an embedded vector of an entity according to an embodiment of the present application;

FIG. 4 is a graph of data related to a game A crawled from a game website B, according to an embodiment of the present application;

fig. 5 is a schematic diagram of a network structure of DenseNet used in an embodiment of the present application;

FIG. 6 is a schematic diagram of an undirected graph and an adjacency matrix corresponding to a knowledge-graph provided in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a propagation layer including a two-layer graph convolutional neural network and a two-layer residual network according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of training a propagation layer according to an embodiment of the present application;

FIG. 9 is comparative test data of the entity alignment method provided in the embodiments of the present application and several common entity alignment methods;

FIG. 10A is a schematic interface diagram of a game website after merging a plurality of data sources according to an embodiment of the present application;

FIG. 10B is a schematic interface diagram of a detailed interface corresponding to a game according to an embodiment of the present disclosure;

FIG. 10C is a schematic diagram of an interface of a game platform in an applet after merging a plurality of data sources according to an embodiment of the present application;

FIG. 10D is a schematic diagram of an interface for searching a game through an applet according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a physical alignment apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

For convenience of understanding, terms referred to in the embodiments of the present application are explained below:

artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Knowledge Graph (Knowledge Graph): aiming at describing various entity concepts and their interrelationships, triples are generally constructed from "entity-relationship-entity", each entity possessing its corresponding "attribute". Large-scale knowledge maps often contain hundreds of millions of entities, hundreds of billions of attributes, and billions of relationships, mined from a large body of structured and unstructured data. Based on the special knowledge map and the natural language understanding technology constructed based on the special knowledge map, the machine can give full play to the system performance of reasoning and judgment, answer questions relatively accurately and extend the intelligent range.

Entity (entity): it refers to things that exist objectively and can be distinguished from each other, including concrete people, things, abstract concepts or relationships. For example, in a knowledge graph, a node represents an entity, and edges between nodes represent associations between entities, the entities in the knowledge graph may be people, animals, plants, books, movies, events, and so on.

Entity Alignment (Entity Alignment): also called Entity Matching (Entity Matching), is an important step in the construction of the knowledge graph, and means that for each Entity in different knowledge graphs, the same Entity belonging to the real world is found, i.e. the same Entity is found from different knowledge graphs. The common method for entity alignment is to determine whether entities in different knowledge maps can be aligned, i.e. whether the entities are the same entity, by using attribute information of the entities.

Graph Convolutional neural Network (GCN, Graph conditional Network): the neural network can perform the graph convolution operation, and the core idea is as follows: firstly, each node sends the own characteristic information to a neighbor node after transformation, so as to realize extraction transformation of the characteristic information of the node; secondly, each node gathers the characteristic information of the neighbor nodes to realize the fusion of the local structure information of the nodes; the third step: and the previous information is gathered and then subjected to nonlinear transformation, so that the expression capacity of the model is increased. The graph convolution neural network can fully fuse the characteristic information among the related nodes and increase the expression capacity of the model.

Any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

In a specific practical process, most of the conventional entity alignment methods project attribute information of two entities to the same space, obtain feature vectors of the two entities in the same space, determine whether the two entities are the same entity based on a distance value between the two feature vectors, and the smaller the distance value, the greater the probability that the two entities are the same entity is. However, the conventional entity alignment method ignores the existing correlation property between entities, and reduces the accuracy of entity alignment.

Therefore, the inventor of the present application considers that a feature vector of an entity can be obtained based on an embedded vector of the entity and an embedded vector of at least one entity having an association relationship around the entity, and the same entity is determined based on the obtained feature vector, where the embedded vector is data for characterizing attribute features of the entity itself, which is obtained based on attribute information of the entity, so that the feature vector of the entity includes both the attribute information of the entity itself and attribute information of the entity associated with the entity, thereby enriching the information characterized in the feature vector corresponding to the entity, improving accuracy of describing the entity by the feature vector, and contributing to improving accuracy of entity alignment. Specifically, the entity alignment method provided by the embodiment of the present application includes: aiming at each entity in any knowledge graph, obtaining a characteristic vector of the entity based on an embedded vector of the entity and an embedded vector of the entity having an incidence relation with the entity in any knowledge graph; and determining the same entity in any two knowledge graphs based on the characteristic vector of each entity. Furthermore, the embedded vector of the entity can be obtained based on the attribute information and the picture information of the entity, so that the extracted embedded vector contains information of the entity in various aspects, and the entity alignment accuracy is further improved. In addition, in consideration of the abundant relationship data describing the association relationship between the entities in the knowledge graph and the huge data volume of the knowledge graph, the feature vectors of the entities in the knowledge graph can be rapidly obtained in batches based on the graph convolution neural network so as to improve the processing efficiency of entity alignment, and meanwhile, the graph convolution neural network can well fuse the attribute information of the entities and the surrounding associated entities, so that the entity alignment is assisted.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

Fig. 1 is a schematic view of an application scenario of the entity alignment method according to the embodiment of the present application. The application scenario includes a processing device 101 and a database 102. Wherein, the processing device 101 and the database 102 are connected through a wireless or wired communication network. The processing device 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The database 102 may be a cloud server-based database, and a large number of to-be-processed knowledge maps are stored in the database 102. The processing device 101 obtains knowledge graphs to be processed from the database 102, obtains feature vectors of entities based on the embedded vectors of the entities and the embedded vectors of the entities having association relations with the entities in the knowledge graphs for each entity in any knowledge graph, and determines the same entity in any two knowledge graphs, namely an entity alignment result, based on the feature vectors of each entity. Further, the processing device 101 may merge at least two knowledge maps into one knowledge map based on the entity alignment result, and store the merged knowledge map in the database 102, so as to implement services such as intelligent search and intelligent question and answer based on the merged knowledge map.

Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operation steps as shown in the following embodiments or figures, more or less operation steps may be included in the method based on the conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application.

The following describes the technical solution provided in the embodiment of the present application with reference to the application scenario shown in fig. 1.

Referring to fig. 2, an embodiment of the present application provides an entity alignment method, which can be applied to the processing device 101 shown in fig. 1, and specifically includes the following steps:

s201, aiming at each entity in any knowledge graph, obtaining a feature vector of the entity based on the embedded vector of the entity and the embedded vector of the entity having an association relation with the entity in any knowledge graph.

The embedded vector of the entity is obtained based on the attribute information of the entity, and the embedded vector is used for representing the attribute characteristics of the entity. Specifically, the attribute information of each entity in the knowledge graph can be projected into the same vector space, so as to obtain the embedded vector of each entity.

In particular, the processing device may obtain at least two knowledge-maps from the database, where each knowledge-map includes attribute information of each entity and an association relationship between each entity. The processing device determines entities associated with each entity in the knowledge graph based on the association relationship in the knowledge graph, and obtains feature vectors of the entities based on the embedded vectors of the entities and the embedded vectors of the entities associated with the entities.

The feature vector of the entity obtained in step S201 includes both the attribute information of the entity itself and the attribute information of the entity associated with the entity.

S202, determining the same entity in any two knowledge graphs based on the feature vector of each entity.

In specific implementation, distance values between a feature vector of any entity E1 in knowledge-graph KG1 and feature vectors of all entities in knowledge-graph KG2 may be calculated, if the maximum distance value is greater than a distance threshold, it is determined that two entities corresponding to the maximum distance value are the same entity, and if the maximum distance value is not greater than the distance threshold, it is determined that an entity that is the same as entity E1 in knowledge-graph KG1 does not exist in knowledge-graph KG 2. The distance value between the feature vectors of the two entities can be calculated through algorithms such as Euclidean distance, cosine similarity and Manhattan distance, the distance threshold value can be set by combining with an actual application scene, and the embodiment of the application is not limited.

In specific implementation, the same entities in the two knowledge graphs are determined based on the characteristic vectors of the entities in the two knowledge graphs, and the same entities are fused into one entity, so that the two knowledge graphs are fused into one knowledge graph. Further, more knowledge-maps can be fused into one knowledge-map in the above manner.

The entity alignment method provided by the embodiment of the application can obtain the characteristic vector of the entity based on the embedded vector of the entity and the embedded vector of the entity having the incidence relation around the entity, and determine the same entity based on the obtained characteristic vector, so that the characteristic vector of the entity not only contains the attribute information of the entity, but also contains the attribute information of the entity associated with the entity, thereby enriching the information represented in the characteristic vector corresponding to the entity, improving the accuracy of describing the entity by the characteristic vector, and performing entity alignment based on the characteristic vector is beneficial to improving the accuracy of entity alignment, thereby better performing knowledge map fusion.

Further, referring to fig. 3, the embedded vectors of the respective entities may be obtained by:

s301, acquiring attribute information of the entity from a knowledge graph to which the entity belongs, wherein the attribute information comprises text information describing the attribute of the entity and picture information of the entity.

When the knowledge graph is constructed based on the crawled data, the crawled data can be divided into text information and picture information. Referring to fig. 4, the related data of game a crawled from game website B includes words of description class for game a, such as game name, game developer, game copyright, game brief description, and the like, and the words are cleaned to obtain text information of entity "game a", and the related data also includes pictures in game a as picture information.

S302, extracting text characteristic vectors from the text information of the entity.

Specifically, each character in the text information of the entity may be converted into a corresponding initial vector, and based on the initial vector of each character, a text feature vector including context information is obtained.

For example, the text message "game studio" includes five characters, "game", "work", and "room", and the five characters are respectively subjected to 20-dimensional vector initialization, for example, the initial vector corresponding to the character "game" is [1,0.8, …,0.75], the initial vector corresponding to the character "game" is [0.5,0.12, …,0.65], the purpose of vector initialization is to map each character to the same vector space, and each character corresponds to a unique vector in the vector space. Then, a text feature vector may be obtained based on the initial vectors of the characters, and specifically, the initial vectors of all characters in the text information may be input into a Bag-of-words model (BoW), a Long Short-Term Memory network (LSTM), or an N-gram (N-gram), so as to obtain a text feature vector containing context information.

The N-gram can well mine the context relation in the text information, deeper features between contexts can be extracted based on the N-gram, the accuracy of text feature vectors is improved, and the N-gram can be used without training. Specifically, the text feature vector may be obtained based on the following formula provided by the N-gram:

wherein a represents text information, f_a(a) Representing the text feature vector, N is the window size used by the N-gram model, representing the initial vector of the first N characters in the text information to participate in the operation, c_jWhich is the initial vector for the jth character in the text information, t is the number of characters contained in the text information. In practical application, if the number of characters contained in the text information is less than N, when j is more than or equal to N, c_j＝0。

S303, extracting image characteristic vectors from the picture information of the entity.

In specific implementation, the image feature vector can be extracted from the picture information of the entity based on the trained image feature extraction model. For example, a trained densneet (dense Connected convolutional network) can be used to extract image feature vectors from the picture information of the entity, so that additional training of the image feature extraction model is not required. The network structure of densneet is shown in fig. 5, and includes a plurality of convolution layers (convolution), a plurality of DenseBlock layers (pooling), and a plurality of pooling layers (pooling) and linear output layers (linear), and the picture information is input into a DenseNet network, and after the processing of the network layers, a 1024-dimensional vector output by the last pooling layer is used as an image feature vector of the picture information, for example, the image feature vector may be [1.013232,1.03242,0.434343, …,0.3823243 ].

Of course, in practical application, other Convolutional Neural Networks (CNNs) may also be used as the image feature extraction model, and the embodiment of the present application is not limited.

In specific implementation, the execution sequence of steps S302 and S303 is not sequential.

And S304, obtaining an embedded vector of the entity based on the text feature vector and the image feature vector.

In specific implementation, the text characteristic vector and the image characteristic vector of the entity can be directly spliced, and the spliced text characteristic vector and the spliced image characteristic vector are used as the embedded vector of the entity. For example, entity A has a text feature vector of [ w₁,w₂,……,w_n]The image feature vector of the entity A is [ v ]₁,v₂,……,v_m]The text feature vector [ w₁,w₂,……,w_n]And image feature vector v₁,v₂,……,v_m]Spliced together to obtain a vector [ w ] with n + m dimensions₁,w₂,……,w_n,v₁,v₂,……,v_m]As an embedded vector for entity a.

In the embodiment shown in fig. 3, the embedded vector of the entity is obtained based on the text information and the picture information of the entity, so that the attribute information of the entity is enriched, the extracted embedded vector contains information of the entity in various aspects, and the entity alignment accuracy is further improved.

In practical applications, the embedded vector of the entity may also be obtained based on only the text information of the entity, that is, the text feature vector extracted from the text information of the entity is directly used as the embedded vector of the entity.

On the basis of any one of the above embodiments, in consideration of the relationship data describing the association relationship between the entities in the knowledge graph and the huge data amount of the knowledge graph, the feature vectors of the entities can be extracted based on the graph convolution neural network. Specifically, the propagation layer for extracting the feature vectors of the entities comprises at least one layer of graph convolution neural network, an embedded vector matrix and an adjacent matrix of the knowledge graph are input into the trained propagation layer, the feature vectors of the entities in the knowledge graph are obtained, the embedded vector matrix of the knowledge graph comprises the embedded vectors of the entities in the knowledge graph, and the adjacent matrix of the knowledge graph is used for representing the association relation between the entities in the knowledge graph.

In specific implementation, knowledge-map KG is used_iTraversing all relation triples thereof and converting knowledge-graph KG_iOne entity in the knowledge graph KG is used as a node, and an edge is added between two entities with a relationship to obtain the knowledge graph KG_iCorresponding undirected graph G_iFrom undirected graph G_iObtain the adjacency matrix A_i. Referring to fig. 6, the undirected graph corresponds to a knowledge graph, where there are association relationships between entity a and entity B, entity a and entity D, entity E and entity B, entity E and entity D, and entity E and entity C, if there is an association relationship between two entities, the corresponding value in the adjacency matrix is 1, and if there is no association relationship between two entities, the corresponding value in the adjacency matrix is 0.

The input of each layer of graph convolution neural network in the reference graph propagation layer comprises an input matrix H^(l)And an adjacency matrix A, the output matrix of the graph convolution neural network can be represented as H^(l+1)＝f(H^(l)A). Each layer of graph convolution neural network in the propagation layer obtains an output matrix and outputs the output matrix based on the received input matrix and the adjacent matrix, and the output matrix of each layer of graph convolution neural network comprises intermediate eigenvectors of each entityWherein the matrix H is output^(l+1)The intermediate feature vector of any entity is based on the input matrix H of any entity^(l)And is obtained based on the corresponding vector in the input matrix of at least one entity characterized in the adjacency matrix a and having an association relationship with any entity. Wherein, the first layer graph convolution neural network receives an input matrix H⁽¹⁾For the embedded vector matrix of the knowledge graph, the input matrix received by other graph convolution neural network is the output matrix of the graph convolution neural network of the previous layer, for example, the input matrix of the graph convolution neural network of the second layer is the output matrix H connected with the output of the graph convolution neural network of the first layer⁽²⁾＝f(H⁽¹⁾A), the input matrix of the convolution neural network of the third layer is the output matrix H connected with the output of the convolution neural network of the second layer⁽³⁾＝f(H⁽²⁾,A)。

And obtaining the characteristic vector of each entity in the knowledge graph input into the propagation layer based on the output matrix of the last graph convolution neural network in the propagation layer.

The characteristic vectors of the entities are extracted based on the graph convolution neural network, the characteristic vectors of all the entities in the knowledge graph are obtained rapidly in batches, the processing efficiency of entity alignment is improved, meanwhile, the graph convolution neural network can well fuse the information of the entities and the surrounding associated entities, and therefore the graph convolution neural network has an auxiliary effect on entity alignment.

Further, a layer of residual network may be connected after each layer of the graph convolution neural network in the propagation layer. Based on the above, in the propagation layer, any residual network except the last layer of residual network fuses the input matrix and the output matrix of the convolutional neural network of the previous layer of graph to obtain the input matrix of the convolutional neural network of the next layer of graph; in the propagation layer, the last layer of residual error network fuses the input matrix and the output matrix of the convolutional neural network of the last layer of graph to obtain the characteristic vector of each entity in any knowledge graph.

Specifically, the residual network may directly add the corresponding input matrix and output matrix to obtain a fusion result, i.e., adding vectors corresponding to the same entity in the input matrix and the output matrixAnd obtaining a fusion result. For example, the input matrix is [ u ]₁,u₂,u₃]The output matrix is [ u'₁,u’₂,u’₃]Then the fused matrix is [ u ]₁+u’₁,u₂+u’₂,u₃+u’₃]. Alternatively, the residual network may weight the input matrix and the output matrix to obtain a fused result, for example, the input matrix is [ u [ ]₁,u₂,u₃]The output matrix is [ u'₁,u’₂,u’₃]Then the fused matrix is [ α u ]₁+βu’₁,αu₂+βu’₂,αu₃+βu’₃]The weighted values α and β may be determined according to application requirements, generally to ensure that the feature vector of each entity after being learned by the propagation layer can retain attribute information of more entities themselves, α may be greater than β, for example, α ═ 1, β ═ 0.5.

Referring to fig. 7, a schematic diagram of the structure of a propagation layer including a two-layer graph convolutional neural network and a two-layer residual network is shown. Assuming an embedded vector matrix H of the knowledge graph⁽¹⁾And an adjacency matrix A, wherein the input matrix of the first layer graph convolutional neural network is X (1) ═ H⁽¹⁾And an adjacency matrix A, the first layer graph convolution neural network is based on H⁽¹⁾And an adjacent matrix A to obtain an output matrix H⁽²⁾＝f(H⁽¹⁾A), the first layer residual error network convolves the input matrix H of the first layer graph with the neural network⁽¹⁾And an output matrix H⁽²⁾Adding to obtain a matrix X (2), and inputting the matrix X into a second layer of graph convolution neural network; the input matrix of the second layer graph convolution neural network is X (2) and an adjacent matrix A, and the second layer graph convolution neural network obtains an output matrix H based on the X (2) and the adjacent matrix A⁽³⁾＝f(H⁽¹⁾A), the second layer residual error network convolution input matrix X (2) and output matrix H of the second layer graph neural network⁽³⁾And adding to obtain a matrix X (3), wherein the matrix X (3) comprises the characteristic vectors of all the entities in the knowledge graph.

By adding a residual error network behind each graph convolution neural network, each entity can learn the attribute information of the peripheral associated entities after passing through each graph convolution neural network, and meanwhile, the attribute information of the entity cannot be changed too much, so that the characteristic vector of the entity well retains the attribute characteristics of the entity and also contains the attribute characteristics of other peripheral associated entities.

In particular, referring to fig. 8, the propagation layer may be trained by:

s801, respectively inputting the embedded vector matrix and the adjacent matrix corresponding to the two knowledge graphs in the training sample into the propagation layer, and obtaining the characteristic vector of each entity in the two knowledge graphs.

In specific implementation, each training sample comprises two knowledge graphs and labeling information aiming at the two knowledge graphs, wherein the labeling information comprises the same entity in the two knowledge graphs. According to the method provided by the embodiment of the application, the embedded vector matrix and the adjacent matrix corresponding to each knowledge graph are determined, and are not described again.

S802, determining the same entities in the two knowledge maps according to the labeling information of the training samples, taking the feature vectors of the two same entities as positive samples, and taking the feature vectors of the two entities which are not the same entities as negative samples.

In specific implementation, a positive sample in the two knowledge graphs of the training sample is removed, then an entity is randomly extracted from the two knowledge graphs from which the positive sample is removed, and the feature vectors of the two extracted entities are combined into a negative sample. Or, assuming that two knowledge maps of the training sample are KG1 and KG2, respectively, removing entities belonging to positive samples in KG1, randomly extracting one entity from the remaining entities of KG1, then randomly extracting one entity from KG2, and combining feature vectors of the two extracted entities into one negative sample; it is also possible to remove the entities belonging to positive samples in KG2, randomly extract one entity from the remaining entities of KG2, then randomly extract one entity from KG1, and combine the feature vectors of the two extracted entities into one negative sample.

In practical application, a plurality of negative samples may be randomly extracted, specifically, the number of the negative samples is positively correlated with the number of the positive samples, for example, the number of the negative samples may be 2 to 5 times of the number of the positive samples, and the specific multiple may be determined according to a practical application scenario, which is not limited in the embodiment of the present application.

For example, a training sample includes knowledge-graph KG1 and knowledge-graph KG2, and the corresponding annotation information includes: the entity E11 in KG1 and the entity E21 in KG2 are the same entity, the entity E14 in KG1 and the entity E22 in KG2 are the same entity; KG1 includes 9 entities E11-E19, KG2 includes: 8 entities E21-E18; according to the labeling information, the feature vector of E11 and the feature vector of E21 are used as a positive sample, and the feature vector of E14 and the feature vector of E22 are used as a positive sample; then, negative samples are randomly drawn from KG1 and KG2 from which positive samples are removed, for example, the eigenvector of E11 and the eigenvector of E22 may be a negative sample, the eigenvector of E11 and the eigenvector of E23 may be a negative sample, the eigenvector of E12 and the eigenvector of E22 may be a negative sample, and so on.

And S803, obtaining a loss function value based on the obtained positive sample and negative sample.

In specific implementation, the loss function value can be calculated by using the following loss function:

where C is a set of positive samples, C ' is a set of negative samples, (p, q) represents a positive sample, (p ', q ') represents a negative sample, d (p, q) represents a distance value of two feature vectors p and q in the positive sample, and d (p ', q ') represents a distance value of two feature vectors p ' and q ' in the negative sample.

S804, updating the weight parameter of each layer of graph convolution neural network in the propagation layer based on the loss function value.

In specific implementation, the weight parameters of the graph convolutional neural network can be updated by adopting common algorithms such as a gradient descent method and the like, so that the distance value of the two eigenvectors in the positive sample is smaller and smaller, and the distance value of the two eigenvectors in the negative sample is larger and larger, and the specific process is not repeated.

And S805, judging whether an ending condition is met, if so, ending the training, otherwise, returning to the step S801, and continuing to train the propagation layer based on the training sample.

The ending condition may be that the loss function value is smaller than a preset value, or that the iteration number exceeds a preset number. Before the end condition is met, the iteration process of the steps S801-S801 is executed circularly, and the graph convolutional neural network is trained through continuous iteration. Returning to step S801, training may be continued based on the training sample used before, or training may be performed by replacing a new training sample.

Referring to fig. 9, comparative test data of the entity alignment method provided in the embodiment of the present application and several entity alignment methods commonly known in the prior art are shown. The existing entity alignment method comprises the following steps: attre and Multi-view, and methods for entity alignment using N-grams alone. Based on the entity alignment method provided by the embodiment of the application, multiple modes such as GCN + BOW, GCN + LSTM, GCN + N-gram and GCN + N-gram + Image are tested. The GCN + BOW is used for extracting a feature vector of an entity based on CNN and extracting a text feature vector of the entity based on BOW; GCN + LSTM is used for extracting a feature vector of an entity based on CNN and extracting a text feature vector of the entity based on LSTM; the GCN + N-gram is used for extracting a feature vector of an entity based on CNN and extracting a text feature vector of the entity based on N-gram; in the three modes, the attribute information of the entity only contains text information, so that the extracted text characteristic vector is directly used as an embedded vector of the entity; GCN + N-gram + Image refers to extracting a text feature vector of an entity based on N-gram, extracting an Image feature vector of the entity based on DenseNet, and extracting a feature vector of the entity based on CNN.

For any entity alignment method listed in fig. 9, multiple groups of test data are tested to obtain test results of each group, each group of test data includes two knowledge graphs, and the same entities in the two knowledge graphs are labeled. Taking the GCN + N-gram + Image method as an example, the test process for one group of data is as follows: firstly, acquiring a feature vector of each entity in each knowledge graph in the group of test data based on a GCN + N-gram + Image method; then, the user can use the device to perform the operation,respectively calculating the distance values of the feature vectors of an entity E1 and the feature vectors of all the entities in another knowledge-graph aiming at the entity E1 marked as the same entity by a certain knowledge-graph KG 1; then, based on the order of the distance values from small to large, all the entities in another knowledge-graph KG2 are sorted, and the sorting of the entities E2 in the knowledge-graph KG2 is determined, wherein the entity E1 in the knowledge-graph KG1 and the entity E2 in the knowledge-graph KG2 are marked as the same entities in advance, and if the entity E2 is arranged at the top 1%, the correct times N of the GCN + N-gram + Image method on hit @1 are determined₁Increment by 1, if entity E2 ranks first 10%, then the correct number of times N on hit @10 for the GCN + N-gram + Image method₁₀Increment by 1, if entity E2 ranks first 50%, then the correct number of times N on hit @50 for the GCN + N-gram + Image method₅₀And increased by 1. After M groups of test data are tested based on the mode, the accuracy rate of the GCN + N-gram + Image method on hit @1 is N₁(ii) accuracy in hit @10 of N₁₀(ii) accuracy in hit @50 of N₅₀and/M. Referring to FIG. 9, the accuracy of the GCN + N-gram + Image method at hit @1 is 84.60%, indicating a probability of 84.60% for the same entity falling on the first 1% when aligning entities based on the GCN + N-gram + Image method; the accuracy of the GCN + N-gram + Image method at hit @10 is 90.82%, indicating that the probability of the same entity falling at the top 10% is 90.82% when entity alignment is performed based on the GCN + N-gram + Image method; the accuracy of the GCN + N-gram + Image method at hit @50 is 95.26%, indicating that the probability of the same entity falling on the top 50% is 95.26% when the entities are aligned based on the GCN + N-gram + Image method.

Based on the data in FIG. 9, the GCN + N-gram is much more effective than the N-gram alone; in the three methods of GCN + BOW, GCN + LSTM and GCN + N-gram, the GCN + N-gram has the best effect, and the N-gram is beneficial to improving the extraction accuracy of the text feature vector; the GCN + N-gram + Image increases picture information on the basis of the GCN + N-gram, and further improves the entity alignment accuracy, especially on hit @10 and hit @ 50.

The process of aligning the entities of the game is described below by taking the knowledge graph of the game field as an example.

First, a game knowledge graph corresponding to each data source is obtained. The specific process comprises the following steps: the method comprises the steps of crawling relevant data of games from various data sources such as various game libraries, various game issuing platforms, hundred-degree encyclopedia, Wikipedia and the like, cleaning the relevant data, obtaining attribute information of each game in different data sources, obtaining a corresponding game knowledge graph based on the data collected from each data source, enabling each game in the data sources to serve as an entity in the corresponding game knowledge graph, and obtaining an adjacency matrix of the game knowledge graph corresponding to the data sources according to the incidence relation among the games in the data sources. Taking fig. 4 as an example, it shows the relevant data of game a crawled from game website B, the relevant data includes the characters of description class for game a, such as game name, game developer, game copyright, game introduction, etc., the characters are cleaned to obtain the text information of entity "game a", the relevant data also includes the pictures in game a, and these pictures are used as picture information, so that the attribute information of game a in the data source of game website B can be obtained, in practical application, the relevant data of game is not limited to the information shown in fig. 4, but also includes the game version, game type, system supported by game, release year, etc. By the method, the game knowledge graphs from different data sources can be obtained, and the game knowledge graph corresponding to each data source comprises the following steps: attribute information of the entities and an adjacency matrix characterizing the association between the entities.

Then, for each entity in any game knowledge graph, obtaining a characteristic vector of the entity based on the embedded vector of the entity and the embedded vector of the entity having an association relationship with the entity in the game knowledge graph, wherein the embedded vector of the entity is obtained based on the attribute information of the entity in the game knowledge graph to which the entity belongs. The same entity in any two game knowledge graphs is determined based on the feature vectors of each entity. Specifically, the distance values between the feature vector of any entity E1 in the game knowledge-map KG1 and the feature vectors of all entities in the game knowledge-map KG2 may be calculated, if the maximum distance value is greater than the distance threshold, it is determined that the two entities corresponding to the maximum distance value are the same entity, and if the maximum distance value is not greater than the distance threshold, it is determined that the entity that is the same as the entity E1 in the game knowledge-map KG1 does not exist in the game knowledge-map KG 2. For a more specific implementation, reference may be made to the entity alignment manner provided in the present application, which is not described again.

Finally, the same entities are fused into one entity, so that a plurality of game knowledge maps are fused into one game knowledge map. In practical application, the attribute information of the same entity in different knowledge spectrograms is fused in the fused game knowledge spectrogram, and the attribute information is effectively integrated, so that the query of a user can be facilitated.

Referring to fig. 10A, a schematic interface diagram of a game website after merging a plurality of data sources is shown. After the user logs in the game spectrum website, a search keyword, such as a game name, can be input in the search box, and the server of the game website searches related games from the fused knowledge graph based on the search keyword input by the user and feeds back the related games to the user. The user can also set some search conditions, such as the game release platform, the game platform, the release age and the like, and the server of the game website searches the games meeting the search conditions from the fused knowledge graph based on the search conditions set by the user and feeds back the games to the user. The user clicks the game in the game search result column to obtain the attribute information of the game, the attribute information is derived from the fused game knowledge graph, referring to fig. 10A, the user clicks the icon corresponding to the game a in the game search result column of fig. 10A to open the detail interface corresponding to the game a, as shown in fig. 10B, the user can obtain the attribute information of the game a in the detail interface corresponding to the game a, the attribute information includes the game name, the game type, the game developer, the game profile and the like, and the user can download the application program corresponding to the game a by clicking the game download control.

The knowledge graph of the game website carries out entity alignment processing on the games in each data source, namely the same games of different data sources are fused into one entity, and integration of the games of different sources is realized.

Referring to fig. 10C, it is a schematic diagram of an interface of the game platform in the applet after merging multiple data sources. The applet background can sort the games based on the attribute information of the games in the fused game knowledge graph to obtain various ranking lists, for example, the ranking is performed according to the daily average usage amount, the downloading amount, the score and other information of the games to generate list lists such as a popular list, a potential list and an expecting list for the user to view. Referring to fig. 10D, the user may input a search keyword, such as a game name, a game type, etc., in a search box of the applet, the applet background searches for a related game from the fused knowledge graph based on the search keyword input by the user, and displays the search result in the applet. The execution mode of the applet background and the background execution mode of the game website shown in fig. 10A are teardrops, which are not described again.

As shown in fig. 11, based on the same inventive concept as the entity alignment method, the embodiment of the present application further provides an entity alignment apparatus 110, which includes a feature vector obtaining module 1101 and an alignment module 1102.

A feature vector obtaining module 1101, configured to obtain, for each entity in any one of the knowledge maps, a feature vector of the entity based on an embedded vector of the entity and an embedded vector of an entity having an association relationship with the entity in any one of the knowledge maps, where the embedded vector of the entity is obtained based on attribute information of the entity;

an alignment module 1102, configured to determine the same entity in any two knowledge graphs based on the feature vectors of the respective entities.

Optionally, the entity aligning apparatus 110 further includes an embedded vector obtaining module 1103, configured to obtain an embedded vector of the entity by:

acquiring attribute information of an entity from a knowledge graph to which the entity belongs, wherein the attribute information comprises text information describing attributes of the entity and picture information of the entity;

extracting text characteristic vectors from text information of an entity;

extracting image characteristic vectors from the picture information of the entity;

based on the text feature vector and the image feature vector, an embedded vector of the entity is obtained.

Optionally, the embedded vector obtaining module 1103 is specifically configured to:

respectively converting each character in the text information of the entity into corresponding initial vectors;

and obtaining a text feature vector containing context relation information based on the initial vector of each character.

Optionally, the feature vector obtaining module 1101 is specifically configured to:

inputting the embedded vector matrix and the adjacent matrix of any knowledge graph into a trained propagation layer, wherein the propagation layer comprises at least one layer of graph convolution neural network, the embedded vector matrix of the knowledge graph comprises embedded vectors of all entities in the knowledge graph, and the adjacent matrix of the knowledge graph is used for representing the incidence relation among all the entities in the knowledge graph;

each layer of graph convolution neural network obtains an output matrix and outputs the output matrix based on a received input matrix and an adjacent matrix, wherein the input matrix received by the first layer of graph convolution neural network is an embedded vector matrix, the input matrix received by other graph convolution neural networks is an output matrix of the previous layer of graph convolution neural network, the output matrix of each layer of graph convolution neural network comprises intermediate characteristic vectors of all entities, and the intermediate characteristic vector of any entity in each output matrix is obtained based on a vector corresponding to any entity in the input matrix and a vector corresponding to an entity which is characterized in the adjacent matrix and has an incidence relation with any entity in the input matrix;

and obtaining the characteristic vector of each entity in any knowledge graph based on the output matrix of the last layer of graph convolution neural network.

Optionally, a layer of residual network is connected after each layer of graph convolution neural network.

Accordingly, the feature vector acquisition module 1101 is further configured to:

any residual network except the last layer of residual network fuses the input matrix and the output matrix of the convolutional neural network of the previous layer of graph to obtain the input matrix of the convolutional neural network of the next layer of graph;

obtaining the feature vector of each entity in any knowledge graph based on the output matrix of the last graph convolution neural network, which specifically comprises the following steps:

and the last layer of residual error network fuses the input matrix and the output matrix of the convolutional neural network of the last layer of map to obtain the characteristic vector of each entity in any knowledge map.

Optionally, the propagation layer is trained by:

respectively inputting the embedded vector matrix and the adjacent matrix corresponding to the two knowledge maps in the training sample into a propagation layer to obtain the characteristic vector of each entity in the two knowledge maps;

determining the same entities in the two knowledge maps according to the labeling information of the training samples, taking the feature vectors of the two same entities as positive samples, and taking the feature vectors of the two entities which are not the same entities as negative samples;

obtaining a loss function value based on the obtained positive and negative samples;

and updating the weight parameters of the convolution neural network of each layer of the graph in the propagation layer based on the loss function value.

The entity alignment device and the entity alignment method provided by the embodiment of the application adopt the same inventive concept, can obtain the same beneficial effects, and are not repeated herein.

Based on the same inventive concept as the entity alignment method, the embodiment of the present application further provides an electronic device, which may specifically be the processing device 101 shown in fig. 1. As shown in fig. 12, the electronic device 120 may include a processor 1201 and a memory 1202.

The Processor 1201 may be a general-purpose Processor, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component, which may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.

Memory 1202, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1202 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.

Embodiments of the present application provide a computer-readable storage medium for storing computer program instructions for the electronic device, which includes a program for executing the entity alignment method.

The computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.

The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method of the embodiments of the present application, and should not be construed as limiting the embodiments of the present application. Modifications and substitutions that may be readily apparent to those skilled in the art are intended to be included within the scope of the embodiments of the present application.

Claims

1. A method of entity alignment, comprising:

2. The method of claim 1, wherein the embedded vector of the entity is obtained by:

extracting text characteristic vectors from text information of an entity;

and obtaining an embedding vector of the entity based on the text feature vector and the image feature vector.

3. The method according to claim 2, wherein the extracting the text feature vector from the text information of the entity specifically comprises:

and obtaining a text feature vector containing context relationship information based on the initial vector of each character.

4. The method according to any one of claims 1 to 3, wherein the obtaining, for each entity in any one of the knowledge-graphs, a feature vector of the entity based on the embedded vector of the entity and the embedded vector of the entity having an association relationship with the entity in any one of the knowledge-graphs specifically comprises:

inputting the embedded vector matrix and the adjacent matrix of any knowledge graph into a trained propagation layer, wherein the propagation layer comprises at least one layer of graph convolutional neural network, the embedded vector matrix of the knowledge graph comprises embedded vectors of all entities in the knowledge graph, and the adjacent matrix of the knowledge graph is used for representing the incidence relation among all the entities in the knowledge graph;

each layer of graph convolution neural network obtains an output matrix and outputs the output matrix based on the received input matrix and the adjacent matrix, wherein the input matrix received by the first layer of graph convolution neural network is an embedded vector matrix, the input matrices received by other graph convolution neural networks are the output matrices of the previous layer of graph convolution neural network, and the output matrix of each layer of graph convolution neural network comprises the intermediate characteristic vectors of each entity;

5. The method of claim 4, wherein each layer of the graph convolution neural network is followed by a layer of residual network;

the method further comprises the following steps:

the obtaining of the feature vector of each entity in any one knowledge graph based on the output matrix of the last graph convolution neural network specifically includes:

6. The method of claim 4, wherein the propagation layer is trained by:

respectively inputting an embedded vector matrix and an adjacent matrix which respectively correspond to two knowledge maps in a training sample into the propagation layer to obtain a characteristic vector of each entity in the two knowledge maps;

and updating the weight parameters of the graph convolution neural network of each layer in the propagation layer based on the loss function values.

7. A physical alignment device, comprising:

8. The apparatus of claim 7, further comprising an embedded vector obtaining module configured to obtain the embedded vector of the entity by:

extracting text characteristic vectors from text information of an entity;

9. The apparatus of claim 8, wherein the embedded vector acquisition module is specifically configured to:

10. The apparatus according to any one of claims 7 to 9, wherein the feature vector obtaining module is specifically configured to:

11. The apparatus of claim 10, wherein a layer of residual network is connected after each layer of graph convolution neural network;

the feature vector obtaining module is further configured to:

12. The apparatus of claim 10, wherein the propagation layer is trained by:

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.

14. A computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, implement the steps of the method of any one of claims 1 to 6.