CN111563192B

CN111563192B - Entity alignment method, device, electronic equipment and storage medium

Info

Publication number: CN111563192B
Application number: CN202010349062.9A
Authority: CN
Inventors: 张涵宇; 李雪莲; 程序
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2023-05-30
Anticipated expiration: 2040-04-28
Also published as: CN111563192A

Abstract

The application relates to the technical field of artificial intelligence and discloses an entity alignment method, an entity alignment device, electronic equipment and a storage medium, wherein entity alignment is performed by utilizing a machine learning technology in artificial intelligence so as to improve the accuracy of feature vector description entity, thereby improving the accuracy of entity alignment, and the method comprises the following steps: for each entity in any knowledge graph, obtaining a feature vector of the entity based on the embedded vector of the entity and the embedded vector of the entity with an association relationship with the entity in any knowledge graph, wherein the embedded vector of the entity is obtained based on attribute information of the entity; and determining the same entity in any two knowledge maps based on the feature vector of each entity.

Description

Entity alignment method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for entity alignment, an electronic device, and a storage medium.

Background

With the development of the Internet, the data information quantity is increased in an explosive manner, the data is redundant and has huge scale, and in order to solve the series of problems, the aim is to develop knowledge graph technology for describing entities and relations among the entities in the real world. In most of the conventional entity alignment methods, attribute information of two entities is projected to the same space, feature vectors of the two entities in the same space are obtained, whether the two entities are the same entity is determined based on a distance value between the two feature vectors, and the smaller the distance value is, the greater the probability that the two entities are the same entity is indicated. However, the conventional entity alignment method ignores the correlation property existing between entities, and reduces the accuracy of entity alignment.

Disclosure of Invention

The embodiment of the application provides an entity alignment method, an entity alignment device, electronic equipment and a storage medium, so as to improve the accuracy of feature vector description entity, and further improve the accuracy of entity alignment.

In one aspect, an embodiment of the present application provides an entity alignment method, including:

for each entity in any knowledge graph, obtaining a feature vector of the entity based on an embedded vector of the entity and an embedded vector of the entity having an association relationship with the entity in any knowledge graph, wherein the embedded vector of the entity is obtained based on attribute information of the entity;

and determining the same entity in any two knowledge maps based on the feature vector of each entity.

In one aspect, an embodiment of the present application provides an entity alignment apparatus, including:

the feature vector acquisition module is used for acquiring a feature vector of an entity based on an embedded vector of the entity and an embedded vector of the entity with an association relation with the entity in any knowledge graph aiming at each entity in any knowledge graph, wherein the embedded vector of the entity is acquired based on attribute information of the entity;

and the alignment module is used for determining the same entity in any two knowledge maps based on the feature vectors of the entities.

In one aspect, an embodiment of the present application provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements steps of any of the methods described above when the processor executes the computer program.

In one aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of any of the methods described above.

In one aspect, an embodiment of the present application provides a computer program product comprising a computer program stored on a computer readable storage medium, the computer program comprising program instructions which when executed by a processor implement the steps of any of the methods described above.

According to the entity alignment method, the device, the electronic equipment and the storage medium, the feature vector of the entity can be obtained based on the embedded vector of the entity and the embedded vector of the entity with the association relationship around the entity, and the same entity is determined based on the obtained feature vector, so that the feature vector of the entity not only contains the attribute information of the entity, but also contains the attribute information of the entity associated with the entity, the information represented in the feature vector corresponding to the entity is enriched, the accuracy of describing the entity by the feature vector is improved, and the entity alignment is carried out based on the feature vector, so that the accuracy of the entity alignment is improved, and knowledge graph fusion is better carried out. Further, the embedded vector of the entity can be obtained based on the attribute information and the picture information of the entity, so that the extracted embedded vector contains information of multiple aspects of the entity, and the entity alignment accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, and it is obvious that the drawings that are described below are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario schematic diagram of an entity alignment method provided in an embodiment of the present application;

FIG. 2 is a flow chart of an entity alignment method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of obtaining an embedded vector of an entity according to an embodiment of the present application;

FIG. 4 is a diagram of data related to game A crawled from game website B according to one embodiment of the present application;

FIG. 5 is a diagram of a network architecture of a DenseNet used in an embodiment of the present application;

fig. 6 is a schematic diagram of an undirected graph and an adjacency matrix corresponding to a knowledge graph according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a propagation layer including a two-layer graph convolutional neural network and a two-layer residual network according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of a training propagation layer according to an embodiment of the present application;

FIG. 9 is a diagram of comparative test data of an entity alignment method and several common entity alignment methods provided in embodiments of the present application;

FIG. 10A is a schematic diagram of an interface of a game website with multiple data sources integrated according to an embodiment of the present disclosure;

FIG. 10B is an interface schematic diagram of a game details interface according to an embodiment of the present disclosure;

FIG. 10C is a schematic diagram of an interface of a game platform in an applet after merging multiple data sources according to an embodiment of the present application;

FIG. 10D is a schematic diagram of an interface for searching for a game through an applet according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a physical alignment apparatus according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

For ease of understanding, the terms referred to in the embodiments of the present application are explained below:

artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Knowledge Graph (knowledgegraph): it is intended to describe various entity concepts and their interrelationships, typically forming triples from "entity-relationship-entities", each entity possessing its corresponding "attribute". Large-scale knowledge maps often contain hundreds of millions of entities, hundreds of billions of attributes, and billions of relationships, mined from a large amount of structured and unstructured data. Based on the special knowledge graph and the natural language understanding technology constructed based on the knowledge graph, the machine can fully play the system performance of reasoning and judging, answer questions relatively accurately and extend the intelligent range.

Entity (entity): what is meant by objectively existing and distinguishable from each other is that it includes specific persons, things, abstract concepts or associations. For example, in a knowledge graph, one node represents one entity, and edges between nodes represent an association relationship between entities, and the entities in the knowledge graph may be people, animals, plants, books, movies, events, and the like.

Entity Alignment (Entity Alignment): also called Entity Matching (Entity Matching), is an important step in knowledge graph construction, and means that for each Entity in different knowledge graphs, the same Entity belonging to the real world is found, i.e. the same Entity is found from different knowledge graphs. The common method for entity alignment is to use attribute information of the entities to determine whether the entities in different knowledge maps can be aligned, i.e. whether the entities are the same entity.

Graph convolutional neural network (GCN, graph Convolutional Network): the neural network capable of carrying out graph convolution operation has the following core ideas: the first step, each node transforms the characteristic information of the node and then sends the transformed characteristic information to the neighbor node to realize extraction transformation of the characteristic information of the node; secondly, each node gathers the characteristic information of the neighbor nodes to realize the fusion of the local structure information of the nodes; and a third step of: the information is gathered and then is subjected to nonlinear transformation, so that the expressive power of the model is increased. The graph convolutional neural network can fully fuse characteristic information among related nodes, and the expression capacity of the model is improved.

Any number of elements in the figures are for illustration and not limitation, and any naming is used for distinction only and not for any limiting sense.

In a specific practical process, most of the conventional entity alignment methods project attribute information of two entities to the same space to obtain feature vectors of the two entities in the same space, and determine whether the two entities are the same entity based on a distance value between the two feature vectors, wherein the smaller the distance value is, the greater the probability that the two entities are the same entity is indicated. However, the conventional entity alignment method ignores the correlation property existing between entities, and reduces the accuracy of entity alignment.

Therefore, the inventor of the application considers that the feature vector of the entity can be obtained based on the embedded vector of the entity and the embedded vector of at least one entity with an association relationship around the entity, and the same entity is determined based on the obtained feature vector, wherein the embedded vector is data which is obtained based on the attribute information of the entity and is used for representing the attribute characteristics of the entity, so that the feature vector of the entity contains both the attribute information of the entity and the attribute information of the entity associated with the entity, thereby enriching the information represented in the feature vector corresponding to the entity, improving the accuracy of describing the entity by the feature vector and being beneficial to improving the accuracy of entity alignment. Specifically, the entity alignment method provided by the embodiment of the application includes: aiming at each entity in any knowledge graph, obtaining a feature vector of the entity based on the embedded vector of the entity and the embedded vector of the entity with an association relation with the entity in any knowledge graph; and determining the same entity in any two knowledge maps based on the feature vector of each entity. Further, the embedded vector of the entity can be obtained based on the attribute information and the picture information of the entity, so that the extracted embedded vector contains information of multiple aspects of the entity, and the entity alignment accuracy is improved. In addition, the abundant relational data describing the association relationship between the entities in the knowledge graph and the huge data volume of the knowledge graph are considered, the feature vectors of the entities in the knowledge graph can be obtained in batches and rapidly based on the graph convolution neural network, so that the processing efficiency of entity alignment is improved, and meanwhile, the graph convolution neural network can well fuse the attribute information of the entities and the surrounding associated entities, so that the entity alignment is assisted.

After the design concept of the embodiment of the present application is introduced, some simple descriptions are made below for application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used to illustrate the embodiment of the present application and are not limiting. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

Referring to fig. 1, an application scenario schematic diagram of an entity alignment method provided in an embodiment of the present application is shown. The application scenario comprises a processing device 101 and a database 102. Wherein the processing device 101 and the database 102 are connected by a wireless or wired communication network. The processing device 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like. Database 102 may be a database implemented based on a cloud server, and a large number of knowledge maps to be processed are stored in database 102. The processing device 101 obtains a knowledge graph to be processed from the database 102, obtains feature vectors of entities based on the embedded vector of the entity and the embedded vector of the entity having an association relationship with the entity in the knowledge graph for each entity in any knowledge graph, and determines the same entity in any two knowledge graphs, namely an entity alignment result, based on the feature vectors of each entity. Further, the processing device 101 may fuse at least two knowledge maps into one knowledge map based on the entity alignment result, and store the fused knowledge maps in the database 102, so as to implement services such as intelligent search, intelligent question-answering, and the like based on the fused knowledge maps.

Of course, the method provided in the embodiment of the present application is not limited to the application scenario shown in fig. 1, but may be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described together in the following method embodiments, which are not described in detail herein.

In order to further explain the technical solutions provided in the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operational steps as shown in the following embodiments or figures, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application.

The technical solution provided in the embodiment of the present application is described below with reference to the application scenario shown in fig. 1.

Referring to fig. 2, an embodiment of the present application provides an entity alignment method, which may be applied to the processing device 101 shown in fig. 1, and specifically includes the following steps:

s201, aiming at each entity in any knowledge graph, obtaining a feature vector of the entity based on the embedded vector of the entity and the embedded vector of the entity with an association relation with the entity in any knowledge graph.

The embedded vector of the entity is obtained based on attribute information of the entity, and the embedded vector is used for representing attribute characteristics of the entity. Specifically, attribute information of each entity in the knowledge graph can be projected into the same vector space, so as to obtain an embedded vector of each entity.

In a specific implementation, the processing device may obtain at least two knowledge maps from the database, where each knowledge map includes attribute information of each entity and an association relationship between each entity. The processing device determines the entity associated with each entity in the knowledge graph based on the association relationship in the knowledge graph, and obtains the feature vector of the entity based on the embedded vector of the entity and the embedded vector of the entity associated with the entity.

The feature vector of the entity obtained in step S201 includes both attribute information of the entity itself and attribute information of the entity associated with the entity.

S202, determining the same entity in any two knowledge maps based on the feature vectors of the entities.

In the implementation, the distance value between the feature vector of any entity E1 in the knowledge graph KG1 and the feature vector of all entities in the knowledge graph KG2 can be calculated respectively, if the maximum distance value is greater than the distance threshold, the two entities corresponding to the maximum distance value are determined to be the same entities, and if the maximum distance value is not greater than the distance threshold, the entity identical to the entity E1 in the knowledge graph KG1 is determined to not exist in the knowledge graph KG 2. The distance value between the feature vectors of the two entities can be calculated through the Euclidean distance, cosine similarity, manhattan distance and other algorithms, and the distance threshold can be set in combination with an actual application scene, so that the embodiment of the application is not limited.

In specific implementation, based on the feature vectors of the entities in the two knowledge maps, the same entity in the two knowledge maps is determined, and the same entity is fused into one entity, so that the two knowledge maps are fused into one knowledge map. Further, more knowledge maps can be fused into one knowledge map through the mode.

According to the entity alignment method, the feature vector of the entity can be obtained based on the embedded vector of the entity and the embedded vector of the entity with the association relationship around the entity, and the same entity is determined based on the obtained feature vector, so that the feature vector of the entity not only contains the attribute information of the entity itself, but also contains the attribute information of the entity associated with the entity, the information represented in the feature vector corresponding to the entity is enriched, the accuracy of the feature vector for describing the entity is improved, and the entity alignment is carried out based on the feature vector, so that the accuracy of the entity alignment is improved, and knowledge graph fusion is better carried out.

Further, referring to fig. 3, the embedded vectors of the respective entities may be obtained by:

s301, acquiring attribute information of an entity from a knowledge graph to which the entity belongs, wherein the attribute information comprises text information describing the attribute of the entity and picture information of the entity.

When the knowledge graph is constructed based on the crawled data, the crawled data can be divided into text information and picture information. Referring to fig. 4, in order to obtain relevant data of game a from game website B, the relevant data includes characters of description class for game a, such as game name, game developer, game copyright party, game introduction, etc., text information of entity "game a" is obtained after cleaning these characters, and the relevant data also includes pictures in game a, and these pictures are taken as picture information.

S302, extracting text feature vectors from the text information of the entity.

Specifically, each character in the text information of the entity can be respectively converted into a corresponding initial vector, and a text feature vector containing the context information can be obtained based on the initial vector of each character.

For example, the text information is "game studio" including five characters, namely "game", "work" and "studio", and 20-dimensional vector initialization is performed on the five characters, for example, an initial vector corresponding to the character "game" is [1,0.8, …,0.75], an initial vector corresponding to the character "game" is [0.5,0.12, …,0.65], and the vector initialization is to map each character to the same vector space, and each character corresponds to a unique vector in the vector space. Text feature vectors may then be obtained based on the initial vectors of the characters, and specifically, the initial vectors of all the characters in the text information may be input into a Bag-of-words model (BoW), a Long Short Term Memory network (LSTM), or an N-gram to obtain text feature vectors containing context information.

The N-gram can well mine context relations in the text information, deeper features among the contexts can be extracted based on the N-gram, accuracy of text feature vectors is improved, and the N-gram can be used without training. Specifically, the text feature vector may be obtained based on the following formula provided by the N-gram:

wherein a represents text information, f _a (a) Representing a text feature vector, N being the window size used by the N-gram model, representing the initial vector participation of the first N characters in the text information for operation, c _j T is the number of characters contained in the text information, which is the initial vector of the j-th character in the text information. In practical application, if the number of characters contained in the text information is less than N, c is when j is greater than or equal to N _j ＝0。

S303, extracting an image feature vector from the image information of the entity.

In practice, the image feature vector may be extracted from the image information of the entity based on the trained image feature extraction model. For example, the trained DenseNet (Densely Connected Convolutional Networks, densely connected convolutional network) can be used to extract image feature vectors from the entity's picture information, thus eliminating the need for additional training of the image feature extraction model. As shown in fig. 5, the network structure of the DenseNet includes a plurality of convolution layers (convolution), a plurality of DenseBlock, a plurality of pooling layers (pooling), and a linear output layer (linear), the picture information is input into the DenseNet network, and the last pooled layer outputs a 1024-dimensional vector as the picture information image feature vector through the processing of the plurality of network layers, for example, the image feature vector may be [1.013232,1.03242,0.434343, …,0.3823243].

Of course, in practical application, other convolutional neural networks (Convolutional Neural Networks, CNN) may be used as the image feature extraction model, which is not limited in the embodiment of the present application.

In specific implementation, the execution sequence of steps S302 and S303 is not sequential.

S304, obtaining an embedded vector of the entity based on the text feature vector and the image feature vector.

In the implementation, the text feature vector and the image feature vector of the entity can be directly spliced, and the spliced embedded vector is used as the embedded vector of the entity. For example, entity A has a text feature vector of [ w ₁ ,w ₂ ,……,w _n ]The image feature vector of entity A is [ v ₁ ,v ₂ ,……,v _m ]Text feature vector [ w ] ₁ ,w ₂ ,……,w _n ]And image feature vector v ₁ ,v ₂ ,……,v _m ]Spliced together to obtain an n+m-dimensional vector [ w ] ₁ ,w ₂ ,……,w _n ,v ₁ ,v ₂ ,……,v _m ]As an embedded vector for entity a.

In the embodiment shown in fig. 3, the embedded vector of the entity is obtained based on the text information and the picture information of the entity, so that the attribute information of the entity is enriched, the extracted embedded vector contains information of multiple aspects of the entity, and the entity alignment accuracy is further improved.

In practical application, the embedded vector of the entity can be obtained only based on the text information of the entity, namely, the text feature vector extracted from the text information of the entity is directly used as the embedded vector of the entity.

Based on any embodiment, considering the relationship data of the association relationship between the abundant descriptive entities in the knowledge graph and the huge data volume of the knowledge graph, the feature vector of the entity can be extracted based on the graph convolution neural network. Specifically, the propagation layer for extracting the feature vectors of the entities comprises at least one layer of graph convolution neural network, an embedded vector matrix and an adjacent matrix of the knowledge graph are input into the trained propagation layer, and the feature vectors of the entities in the knowledge graph are obtained, wherein the embedded vector matrix of the knowledge graph comprises the embedded vectors of the entities in the knowledge graph, and the adjacent matrix of the knowledge graph is used for representing the association relation between the entities in the knowledge graph.

In specific implementation, for knowledge graph KG _i Traversing all the relation triples and carrying out knowledge graph KG _i One entity in the knowledge graph KG is used as a node, and an edge is added between two entities with a relation to obtain the knowledge graph KG _i Corresponding undirected graph G _i According to undirected graph G _i Obtaining an adjacent matrix A _i . Referring to fig. 6, the undirected graph corresponding to a knowledge graph is shown, in which an association relationship exists between an entity a and an entity B, between an entity a and an entity D, between an entity E and an entity B, between an entity E and an entity D, and between an entity E and an entity C, if an association relationship exists between two entities, the corresponding value in the adjacency matrix is 1, and if no association relationship exists between two entities, the corresponding value in the adjacency matrix is 0.

The inputs to the convolutional neural network with reference to each of the graph propagation layers include an input matrix H ^(l) And an adjacency matrix A, the output matrix of the graph roll-up neural network can be represented as H ^(l+1) ＝f(H ^(l) A). Each layer of graph convolution neural network in the propagation layer obtains and outputs an output matrix based on the received input matrix and the adjacent matrix, wherein the output matrix of each layer of graph convolution neural network comprises intermediate eigenvectors of each entity, and the output matrix H ^(l+1) The intermediate eigenvector of any entity is based on the input matrix H of any entity ^(l) And a vector corresponding to at least one entity having an association with any entity characterized in the adjacency matrix a in the input matrix. Wherein, the first layer of graph is rolled up the input matrix H received by the neural network ⁽¹⁾ For the embedded vector matrix of the knowledge graph, the input matrix received by the other graph convolution neural network is the output matrix of the graph convolution neural network of the upper layer, for example, the second layer graph convolution godThe input matrix through the network is an output matrix H which is output by the first layer of graph convolution neural network ⁽²⁾ ＝f(H ⁽¹⁾ The input matrix of the third layer graph convolution neural network is an output matrix H which is connected with the output of the second layer graph convolution neural network ⁽³⁾ ＝f(H ⁽²⁾ ,A)。

And obtaining the feature vector of each entity in the knowledge graph of the input propagation layer based on the output matrix of the neural network of the last layer of graph convolution in the propagation layer.

The feature vectors of the entities are extracted based on the graph convolution neural network, and the feature vectors of the entities in the knowledge graph are obtained in batches and rapidly, so that the processing efficiency of entity alignment is improved, and meanwhile, the graph convolution neural network can well fuse the information of the entities and related entities around the entities, so that the entity alignment is assisted.

Further, a layer of residual network may be connected after each layer of graph in the propagation layer rolls up the neural network. Based on the above, any residual network except the last residual network in the propagation layer fuses the input matrix and the output matrix of the previous layer of graph convolution neural network to obtain the input matrix of the next layer of graph convolution neural network; in the propagation layer, the last layer of residual error network fuses the input matrix and the output matrix of the convolution neural network of the last layer of graph to obtain the feature vector of each entity in any knowledge graph.

Specifically, the residual network may directly add the input matrix and the output matrix corresponding to the residual network to obtain a fusion result, i.e. add vectors corresponding to the same entity in the input matrix and the output matrix to obtain the fusion result. For example, the input matrix is [ u ] ₁ ,u ₂ ,u ₃ ]The output matrix is [ u ]' ₁ ,u’ ₂ ,u’ ₃ ]The fused matrix is [ u ] ₁ +u’ ₁ ,u ₂ +u’ ₂ ,u ₃ +u’ ₃ ]. Alternatively, the residual network may weight the input matrix and the output matrix to obtain a fusion result, e.g., the input matrix is [ u ] ₁ ,u ₂ ,u ₃ ]The output matrix is [ u ]' ₁ ,u’ ₂ ,u’ ₃ ]ThenThe fused matrix is [ alpha u ] ₁ +βu’ ₁ ,αu ₂ +βu’ ₂ ,αu ₃ +βu’ ₃ ]The weighting values α and β may be determined according to application requirements, and in general, in order to ensure that feature vectors of respective entities after propagation layer learning can retain attribute information of more entities themselves, α may be greater than β, for example, α=1, β=0.5.

Referring to fig. 7, a schematic diagram of a propagation layer including a two-layer graph convolutional neural network and a two-layer residual network is shown. Let it be assumed that an embedding vector matrix H of a knowledge graph ⁽¹⁾ And an adjacency matrix a, the input matrix of the first layer graph convolutional neural network is X (1) =h ⁽¹⁾ And an adjacency matrix A, the first layer graph convolutional neural network based on H ⁽¹⁾ And an adjacent matrix A to obtain an output matrix H ⁽²⁾ ＝f(H ⁽¹⁾ A), the first layer residual network convolves the first layer graph with the input matrix H of the neural network ⁽¹⁾ And output matrix H ⁽²⁾ Adding to obtain a matrix X (2), and inputting a second-layer graph convolution neural network; the input matrix of the second-layer graph rolling neural network is X (2) and an adjacent matrix A, and the second-layer graph rolling neural network obtains an output matrix H based on the X (2) and the adjacent matrix A ⁽³⁾ ＝f(H ⁽¹⁾ The second layer residual network wraps the second layer graph with the input matrix X (2) and the output matrix H of the neural network ⁽³⁾ And adding to obtain a matrix X (3), wherein the matrix X (3) comprises the feature vectors of all the entities in the knowledge graph.

By adding a residual network after each graph convolution neural network, each entity can learn the attribute information of the entities related to the periphery of the entity after each graph convolution neural network, and meanwhile, the attribute information of the entity cannot be changed greatly, so that the attribute vector of the entity not only well reserves the attribute characteristics of the entity, but also contains the attribute characteristics of other entities related to the periphery.

In particular implementations, referring to fig. 8, the propagation layer may be trained by:

s801, respectively inputting an embedded vector matrix and an adjacent matrix corresponding to the two knowledge maps in the training sample into a propagation layer to obtain feature vectors of all entities in the two knowledge maps.

In specific implementation, each training sample comprises two knowledge maps and labeling information aiming at the two knowledge maps, and the labeling information comprises the same entity in the two knowledge maps. According to the method provided by the embodiment of the application, the embedded vector matrix and the adjacent matrix corresponding to each knowledge graph can be determined, and the description is omitted.

S802, determining the same entity in the two knowledge maps according to the labeling information of the training samples, taking the feature vectors of the two same entities as positive samples, and taking the feature vectors of the two entities which are not the same entities as negative samples.

In the specific implementation, the positive samples in the two knowledge maps of the training sample are removed, then one entity is randomly extracted from the two knowledge maps with the positive samples removed, and the feature vectors of the two extracted entities are combined into a negative sample. Or, assuming that two knowledge maps of the training sample are KG1 and KG2 respectively, removing entities belonging to the positive sample in the KG1, randomly extracting one entity from the rest entities of the KG1, randomly extracting one entity from the KG2, and combining the feature vectors of the two extracted entities into a negative sample; and the entity belonging to the positive sample in KG2 can be removed, one entity is randomly extracted from the rest entities in KG2, then one entity is randomly extracted from KG1, and the feature vectors of the two extracted entities are combined into a negative sample.

In practical application, a plurality of negative samples may be randomly extracted, specifically, the number of negative samples is positively correlated with the number of positive samples, for example, the number of negative samples may be 2-5 times the number of positive samples, and the specific multiple may be determined according to the practical application scenario, which is not limited in the embodiments of the present application.

For example, a training sample includes a knowledge graph KG1 and a knowledge graph KG2, and the corresponding labeling information includes: entity E11 in KG1 and entity E21 in KG2 are the same entity, entity E14 in KG1 and entity E22 in KG2 are the same entity; KG1 comprises a total of 9 entities E11 to E19, KG2 comprises: E21-E18 are 8 entities in total; according to the labeling information, taking the characteristic vector of E11 and the characteristic vector of E21 as a positive sample, and taking the characteristic vector of E14 and the characteristic vector of E22 as a positive sample; then, a plurality of negative samples are randomly extracted from KG1 and KG2 from which the positive samples are removed, for example, the feature vector of E11 and the feature vector of E22 may be one negative sample, the feature vector of E11 and the feature vector of E23 may be one negative sample, the feature vector of E12 and the feature vector of E22 may be one negative sample, and so on.

S803, based on the obtained positive and negative samples, a loss function value is obtained.

In practice, the loss function value may be calculated using the following loss function:

wherein C is a positive sample set, C ' is a negative sample set, (p, q) represents a positive sample, (p ', q ') represents a negative sample, d (p, q) represents a distance value of two eigenvectors p and q in the positive sample, and d (p ', q ') represents a distance value of two eigenvectors p ' and q ' in the negative sample.

S804, updating the weight parameters of the roll-up neural network of each layer of the graphs in the propagation layer based on the loss function value.

In the specific implementation, the weight parameters of the graph convolutional neural network can be updated by adopting common algorithms such as a gradient descent method, so that the distance values of two eigenvectors in a positive sample are smaller and smaller, and meanwhile, the distance values of two eigenvectors in a negative sample are larger and larger, and the specific process is not repeated.

S805, judging whether an ending condition is met, if yes, ending training, otherwise returning to step S801, and continuing training the propagation layer based on the training sample.

The end condition may be that the loss function value is smaller than a preset value, or that the iteration number exceeds a preset number, or the like. And before the ending condition is met, the iterative processes of the steps S801 to S801 are circularly executed, and the neural network is convolved through continuous iterative training diagrams. Returning to step S801, training may be continued based on the training sample used previously, or training may be performed by replacing a new training sample.

Referring to fig. 9, comparative test data of the entity alignment method provided in the embodiment of the present application and several entity alignment methods commonly used in the prior art are provided. The existing entity alignment method comprises the following steps: attrE and Multi-view, and methods of entity alignment using N-gram alone. Based on the entity alignment method provided by the embodiment of the application, multiple modes such as GCN+BOW, GCN+LSTM, GCN+N-gram and GCN+N-gram+image are tested together. Wherein GCN+BOW refers to extracting feature vectors of entities based on CNN, and extracting text feature vectors of the entities based on BOW; GCN+LSTM refers to extracting feature vectors of entities based on CNN, and extracting text feature vectors of the entities based on LSTM; GCN+N-gram means extracting feature vectors of entities based on CNN, and extracting text feature vectors of the entities based on N-gram; in the three modes, the attribute information of the entity only contains text information, so that the extracted text feature vector is directly used as an embedded vector of the entity; GCN+N-gram+image refers to extracting text feature vectors of entities based on N-gram, extracting Image feature vectors of entities based on DenseNet, and extracting feature vectors of entities based on CNN.

For any entity alignment method listed in fig. 9, multiple sets of test data are tested to obtain test results of each set, each set of test data contains two knowledge maps, and the same entity in the two knowledge maps is marked. Taking the gcn+n-gram+image method as an example, the test procedure for one set of data is: firstly, based on a GCN+N-gram+image method, obtaining feature vectors of all entities in all knowledge maps in the set of test data; then, for an entity E1 marked as the same entity in a certain knowledge graph KG1, respectively calculating the distance values of the feature vector of the entity E1 and the feature vectors of all entities in another knowledge graph; then, based on the order of the distance values from small to large, ordering all the entities in another knowledge graph KG2, and determining the ordering of the entities E2 in the knowledge graph KG2, wherein the entities E1 in the knowledge graph KG1 and the entities E2 in the knowledge graph KG2 are marked as the same entities in advance, and if the entities E2 are ranked at the first 1%, the correct times of the GCN+N-gram+image method on the hit@1 are determinedNumber N ₁ Increasing by 1, if the entity E2 is ranked at the top 10%, the correct number of times N of the GCN+N-gram+image method on hit@10 ₁₀ Increasing 1, if the entity E2 is ranked at the first 50%, then the correct number of times N of the GCN+N-gram+image method on hit@50 ₅₀ Increase by 1. After testing M groups of test data based on the mode, the accuracy of the GCN+N-gram+image method on hit@1 is N ₁ Accuracy at hit@10 of N ₁₀ Accuracy at hit@50 of N ₅₀ and/M. Referring to fig. 9, the accuracy of the gcn+n-gram+image method on hit@1 is 84.60%, indicating that when entity alignment is performed based on the gcn+n-gram+image method, the probability that the same entity falls in the first 1% is 84.60%; the accuracy of the gcn+n-gram+image method on hit@10 was 90.82%, indicating that when entity alignment was performed based on the gcn+n-gram+image method, the probability of the same entity falling 10% before was 90.82%; the accuracy of the gcn+n-gram+image method on hit@50 was 95.26%, indicating that the probability of the same entity falling 50% before when entity alignment was performed based on the gcn+n-gram+image method was 95.26%.

Based on the data of FIG. 9, the effect of GCN+N-gram is improved much more than that of N-gram alone; in the three methods of GCN+BOW, GCN+LSTM and GCN+N-gram, the GCN+N-gram has the best effect, and the N-gram is beneficial to improving the extraction accuracy of text feature vectors; the GCN+N-gram+image increases the Image information on the basis of the GCN+N-gram, and further improves the entity alignment accuracy, especially on hit@10 and hit@50.

The process of entity alignment for a game is described below using a knowledge graph in the game field as an example.

First, a game knowledge graph corresponding to each data source is obtained. The specific process comprises the following steps: relevant data of each game are crawled from each data source such as each game library, each game issuing platform, hundred degrees encyclopedia, wikipedia and the like, attribute information of each game in different data sources is obtained after the relevant data are cleaned, a corresponding game knowledge graph can be obtained based on the data collected from each data source, each game in the data sources can be used as an entity in the corresponding game knowledge graph, and an adjacent matrix of the game knowledge graph corresponding to the data sources can be obtained according to the association relation among the games in the data sources. Taking fig. 4 as an example, related data of the game a crawled from the game website B is given, where the related data includes characters of description types of the game a, such as a game name, a game developer, a game copyright party, a game profile, and the like, and text information of the entity "game a" is obtained after the characters are cleaned, the related data further includes pictures in the game a, and the pictures are used as picture information, so that attribute information of the game a in a data source of the game website B can be obtained. Through the mode, the game knowledge graph from different data sources can be obtained, and the game knowledge graph corresponding to each data source comprises: attribute information of entities and an adjacency matrix characterizing association relations between the entities.

Then, for each entity in any game knowledge graph, the feature vector of the entity is obtained based on the embedded vector of the entity and the embedded vector of the entity with an association relationship with the entity in the game knowledge graph, wherein the embedded vector of the entity is obtained based on the attribute information of the entity in the game knowledge graph to which the entity belongs. And determining the same entity in any two game knowledge maps based on the feature vector of each entity. Specifically, the distance value between the feature vector of any entity E1 in the game knowledge graph KG1 and the feature vector of all entities in the game knowledge graph KG2 may be calculated respectively, if the maximum distance value is greater than the distance threshold value, it is determined that the two entities corresponding to the maximum distance value are the same entity, and if the maximum distance value is not greater than the distance threshold value, it is determined that the entity identical to the entity E1 in the game knowledge graph KG1 does not exist in the game knowledge graph KG 2. For more specific embodiments, reference may be made to the entity alignment method provided by the implementation of the present application, which is not described herein.

Finally, the same entity is fused into one entity, so that a plurality of game knowledge maps are fused into one game knowledge map. In practical application, the attribute information of the same entity in different knowledge spectrums is fused in the fused game knowledge spectrums, and the attribute information is effectively integrated together, so that the user can conveniently inquire.

Referring to FIG. 10A, an interface diagram of a gaming web site after fusing multiple data sources is shown. After the user logs in the game score website, a search keyword, such as a game name, is input in the search box, and the server of the game website searches for a related game from the fused knowledge graph based on the search keyword input by the user and feeds the related game back to the user. The user can also set some search conditions, such as a game release platform, a game platform, a release period and the like, and the server of the game website searches for games meeting the search conditions from the fused knowledge graph based on the search conditions set by the user and feeds the games back to the user. The user clicks the game in the game search result field to obtain the attribute information of the game, the attribute information is derived from the fused game knowledge graph, referring to fig. 10A, after clicking the icon corresponding to the game a in the game search result field of fig. 10A, the user can open the detail interface corresponding to the game a, as shown in fig. 10B, the user can obtain the attribute information of the game a in the detail interface corresponding to the game a, including the game name, the game type, the game developer, the game introduction and the like, and the user can download the application program corresponding to the game a by clicking the game download control.

Because the knowledge graph of the game website performs entity alignment processing on the games in each data source, namely the same games of different data sources are fused into one entity, the integration of the games of different sources is realized, so that a user can search the games published through different websites or platforms in the game website, only one search result can be obtained for the same games of different sources, and the number of games fed back to the user during the game searching is greatly reduced.

Referring to FIG. 10C, an interface diagram of a game platform in an applet after merging multiple data sources is shown. The applet background can sort each game based on the attribute information of each game in the fused game knowledge graph to obtain various ranking charts, for example, ranking according to the daily average usage, downloading amount, score and other information of the game to generate ranking charts such as a popular chart, a potential chart, an expected chart and the like for a user to view. Referring to fig. 10D, the user may input a search keyword, such as a game name, a game type, etc., in a search box of the applet, the applet background searches for a related game from the fused knowledge graph based on the search keyword input by the user, and displays the search result in the applet. The execution mode of the applet background is the same as the background execution mode of the game website shown in fig. 10A, and will not be described again.

As shown in fig. 11, based on the same inventive concept as the entity alignment method described above, the embodiment of the present application further provides an entity alignment apparatus 110, which includes a feature vector obtaining module 1101 and an alignment module 1102.

The feature vector obtaining module 1101 is configured to obtain, for each entity in any one of the knowledge maps, a feature vector of the entity based on an embedded vector of the entity and an embedded vector of the entity having an association relationship with the entity in any one of the knowledge maps, where the embedded vector of the entity is obtained based on attribute information of the entity;

the alignment module 1102 is configured to determine the same entity in any two knowledge maps based on the feature vectors of the entities.

Optionally, the entity alignment device 110 further includes an embedded vector obtaining module 1103, configured to obtain an embedded vector of the entity by:

acquiring attribute information of an entity from a knowledge graph to which the entity belongs, wherein the attribute information comprises text information describing the attribute of the entity and picture information of the entity;

extracting text feature vectors from text information of an entity;

extracting image feature vectors from the picture information of the entity;

based on the text feature vector and the image feature vector, an embedded vector of the entity is obtained.

Optionally, the embedded vector acquisition module 1103 is specifically configured to:

converting each character in the text information of the entity into a corresponding initial vector;

based on the initial vector of each character, a text feature vector containing context information is obtained.

Optionally, the feature vector acquisition module 1101 is specifically configured to:

inputting an embedded vector matrix and an adjacent matrix of any knowledge graph into a trained propagation layer, wherein the propagation layer comprises at least one layer of graph convolution neural network, the embedded vector matrix of the knowledge graph comprises embedded vectors of all entities in the knowledge graph, and the adjacent matrix of the knowledge graph is used for representing association relations among all entities in the knowledge graph;

each layer of graph rolling neural network obtains and outputs an output matrix based on the received input matrix and the adjacent matrix, wherein the input matrix received by the first layer of graph rolling neural network is an embedded vector matrix, the input matrix received by other graph rolling neural networks is the output matrix of the graph rolling neural network of the upper layer, the output matrix of each layer of graph rolling neural network comprises intermediate feature vectors of all entities, and the intermediate feature vector of any entity in each output matrix is obtained based on the corresponding vector of any entity in the input matrix and the corresponding vector of the entity with the association relation with any entity represented in the adjacent matrix in the input matrix;

And obtaining the feature vector of each entity in any knowledge graph based on the output matrix of the final layer of graph convolution neural network.

Optionally, a layer of residual network is connected after each layer of graph convolution neural network.

Accordingly, the feature vector acquisition module 1101 is further configured to:

any residual network except the last layer of residual network fuses the input matrix and the output matrix of the previous layer of graph rolling neural network to obtain the input matrix of the next layer of graph rolling neural network;

based on the output matrix of the last graph convolution neural network, the feature vector of each entity in any knowledge graph is obtained, and the method specifically comprises the following steps:

and the last layer of residual error network fuses the input matrix and the output matrix of the last layer of graph convolution neural network to obtain the feature vector of each entity in any knowledge graph.

Optionally, the propagation layer is trained by:

respectively inputting an embedded vector matrix and an adjacent matrix which correspond to the two knowledge maps in the training sample into a propagation layer to obtain feature vectors of each entity in the two knowledge maps;

determining the same entity in the two knowledge maps according to the labeling information of the training samples, taking the feature vectors of the two same entities as positive samples and taking the feature vectors of the two entities which are not the same entities as negative samples;

Obtaining a loss function value based on the obtained positive and negative samples;

and updating the weight parameters of the roll-up neural network of each layer of the propagation layer based on the loss function value.

The entity alignment device and the entity alignment method provided by the embodiment of the application adopt the same inventive concept, can obtain the same beneficial effects, and are not described herein again.

Based on the same inventive concept as the above-mentioned entity alignment method, the embodiment of the present application further provides an electronic device, which may specifically be the processing device 101 shown in fig. 1. As shown in fig. 12, the electronic device 120 may include a processor 1201 and a memory 1202.

The processor 1201 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, and may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

Memory 1202 is a non-volatile computer-readable storage medium that can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, which may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1202 in the present embodiment may also be circuitry or any other device capable of implementing a memory function for storing program instructions and/or data.

Embodiments of the present application provide a computer readable storage medium storing computer program instructions for use with the above-described electronic device, which contains a program for executing the above-described entity alignment method.

The computer storage media described above can be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), and semiconductor storage (e.g., ROM, EPROM, EEPROM, nonvolatile storage (NAND FLASH), solid State Disk (SSD)), etc.

The foregoing embodiments are only used for describing the technical solution of the present application in detail, but the descriptions of the foregoing embodiments are only used for helping to understand the method of the embodiments of the present application, and should not be construed as limiting the embodiments of the present application. Variations or alternatives readily occur to those skilled in the art and are intended to be encompassed within the scope of the embodiments of the present application.

Claims

1. A method of entity alignment, comprising:

inputting an embedded vector matrix and an adjacent matrix of any knowledge graph into a trained propagation layer, wherein the propagation layer comprises at least one layer of graph convolution neural network, a layer of residual error network is connected after each layer of graph convolution neural network, the embedded vector matrix of the knowledge graph comprises embedded vectors of all entities in the knowledge graph, the adjacent matrix of the knowledge graph is used for representing association relations among all the entities in the knowledge graph, and the embedded vectors of the entities are obtained based on attribute information of the entities;

Each layer of graph rolling neural network obtains and outputs an output matrix based on the received input matrix and the adjacent matrix, wherein the input matrix received by the first layer of graph rolling neural network is an embedded vector matrix, the input matrix received by other graph rolling neural networks is obtained by fusing the input matrix and the output matrix of the previous layer of graph rolling neural network based on a residual error network connected with the previous layer of graph rolling neural network, and the output matrix of each layer of graph rolling neural network comprises intermediate feature vectors of all entities;

fusing an input matrix and an output matrix of a final layer of graph convolution neural network based on the final layer of residual network to obtain feature vectors of all entities in any knowledge graph; the feature vector of the entity not only contains attribute information of the entity, but also contains attribute information of the entity associated with the entity;

2. The method of claim 1, wherein the embedding vector for the entity is obtained by:

Extracting text feature vectors from text information of an entity;

extracting image feature vectors from the picture information of the entity;

and obtaining an embedded vector of the entity based on the text feature vector and the image feature vector.

3. The method according to claim 2, wherein the extracting text feature vectors from the text information of the entity specifically comprises:

4. The method of claim 1, wherein the propagation layer is trained by:

respectively inputting an embedded vector matrix and an adjacent matrix which are respectively corresponding to two knowledge maps in a training sample into the propagation layer to obtain feature vectors of all entities in the two knowledge maps;

5. An entity alignment device, comprising:

the feature vector acquisition module is used for inputting an embedded vector matrix and an adjacent matrix of any knowledge graph into a trained propagation layer, wherein the propagation layer comprises at least one layer of graph convolution neural network, a layer of residual error network is connected after each layer of graph convolution neural network, the embedded vector matrix of the knowledge graph comprises embedded vectors of all entities in the knowledge graph, the adjacent matrix of the knowledge graph is used for representing association relations among all the entities in the knowledge graph, and the embedded vectors of the entities are obtained based on attribute information of the entities;

6. The apparatus of claim 5, further comprising an embedding vector acquisition module configured to obtain an embedding vector for an entity by:

extracting text feature vectors from text information of an entity;

extracting image feature vectors from the picture information of the entity;

7. The apparatus of claim 6, wherein the embedded vector acquisition module is specifically configured to:

8. The apparatus of claim 5, wherein the propagation layer is trained by:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.

10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the method of any of claims 1 to 4.