CN116150405B - Heterogeneous data processing method for multiple scenes - Google Patents

Heterogeneous data processing method for multiple scenes Download PDF

Info

Publication number
CN116150405B
CN116150405B CN202310416643.3A CN202310416643A CN116150405B CN 116150405 B CN116150405 B CN 116150405B CN 202310416643 A CN202310416643 A CN 202310416643A CN 116150405 B CN116150405 B CN 116150405B
Authority
CN
China
Prior art keywords
entity
text
graph
knowledge
relationship
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310416643.3A
Other languages
Chinese (zh)
Other versions
CN116150405A (en
Inventor
曹扬
李响
仲恺
谢红韬
汪榕
支婷
祁纲
程建润
周伟
黄铁淳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiji Computer Corp Ltd
CETC Big Data Research Institute Co Ltd
Original Assignee
Taiji Computer Corp Ltd
CETC Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiji Computer Corp Ltd, CETC Big Data Research Institute Co Ltd filed Critical Taiji Computer Corp Ltd
Priority to CN202310416643.3A priority Critical patent/CN116150405B/en
Publication of CN116150405A publication Critical patent/CN116150405A/en
Application granted granted Critical
Publication of CN116150405B publication Critical patent/CN116150405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a heterogeneous data processing method of multiple scenes, which comprises the following steps: extracting graph structure information from neighbor subgraphs of each text entity in a first knowledge graph and a second knowledge graph under different scenes to obtain structural topological features of the text entities; acquiring relationship features of entity relationships connecting the text entities from the triples of the first knowledge graph and the second knowledge graph; fusing structural topological features of each text entity in the first knowledge graph and the second knowledge graph and relationship features of entity relationships connecting the corresponding text entities to obtain a first entity representation of each text entity; constructing a first entity pair Ji Gailv matrix based on similarity between the first entity representation of any text entity in the first knowledge-graph and the first entity representation of any text entity in the second knowledge-graph; based on the first entity pair Ji Gailv matrix, aligned entity pairs of the first and second knowledge-maps are determined. The invention improves the accuracy of heterogeneous data fusion.

Description

Heterogeneous data processing method for multiple scenes
Technical Field
The invention relates to the technical field of text data processing, in particular to a heterogeneous data processing method for multiple scenes.
Background
With the arrival of big data age, the data scattered in different application scenes can be effectively fused, so that centralized management of data information is realized, and the performance of tasks such as data analysis, data retrieval and the like is improved. In the fields of data recommendation, data retrieval and the like, corresponding knowledge maps are constructed for text data in specific fields or scenes (such as film fields, medical fields and the like), and data fusion is realized by connecting the knowledge maps in each scene together.
The data under different application scenes have isomerism and redundancy, namely, the data under different application scenes have the problems of different structures, different expressions and the like, so that the generated knowledge patterns of different scenes have quite isomerism, and when the isomerism knowledge patterns of multiple scenes are fused, the data redundancy and the data conflict are easy to occur. Therefore, in order to improve the fusion accuracy of the heterogeneous data of multiple scenes, the knowledge maps in different scenes can be aligned with text entities corresponding to the same object, so that the data redundancy and the data conflict in the fusion data are reduced. However, the existing knowledge graph has over-strong isomerism, text entities have different expressions with the same meaning and different meanings with the same expression, the network structure of the knowledge graph is different, even the language is different, and the accuracy of entity alignment is difficult to guarantee.
Disclosure of Invention
The invention provides a heterogeneous data processing method for multiple scenes, which is used for solving the defects of poor accuracy of entity alignment caused by the phenomena of strong isomerism of a knowledge graph and incomplete deletion of part of the knowledge graph in the prior art.
The invention provides a heterogeneous data processing method of multiple scenes, which comprises the following steps:
based on the first graph neural network, obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in a first knowledge graph and a second knowledge graph under different scenes; the first graph neural network is constructed based on a graph convolution neural network;
based on a second graph neural network, acquiring relationship features of entity relationships connecting all text entities from all triples of the first knowledge graph and the second knowledge graph; the second graph neural network is constructed based on a transient model; the triplet comprises two text entities and an entity relation connecting the two text entities;
fusing structural topological features of each text entity in the first knowledge graph and the second knowledge graph and relationship features of entity relationships connecting the corresponding text entities to obtain a first entity representation of each text entity;
Constructing a first entity pair Ji Gailv matrix based on similarity between the first entity representation of any text entity in the first knowledge-graph and the first entity representation of any text entity in the second knowledge-graph;
determining an aligned entity pair of the first knowledge-graph and the second knowledge-graph based on the first entity pair Ji Gailv matrix, and fusing data of the first knowledge-graph and the second knowledge-graph based on the aligned entity pair;
the obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in the first knowledge graph and the second knowledge graph under different scenes specifically comprises:
and a diagram structure information extraction step: for any text entity, determining association characteristics of any neighbor text entity transmitted to the any text entity at the current network layer based on graph entity characteristics of any neighbor text entity at the upper network layer of the any text entity and graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the upper network layer, fusing the association characteristics of each neighbor text entity transmitted to the any text entity at the current network layer to obtain the graph entity characteristics of the any text entity at the current network layer, and updating to obtain the graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the current network layer; the graph entity characteristics of any text entity at the first layer network layer and the graph relation characteristics of any entity relation at the first layer network layer are obtained by initialization;
Iterative steps: repeating the drawing structure information extraction step until drawing entity characteristics of each text entity at the last network layer are obtained, and determining the structure topology characteristics of each text entity based on the drawing entity characteristics of each text entity at each network layer;
the obtaining the relationship features of the entity relationship connecting each text entity from each triplet of the first knowledge graph and the second knowledge graph specifically includes:
acquiring a second entity representation of the head-tail text entity and initial relationship characteristics of entity relationships in each triplet of the first knowledge graph and the second knowledge graph based on a second graph neural network;
fusing initial relation features of entity relations connected with the same text entity to obtain relation features of the entity relations connected with the same text entity;
the second entity representation of any text entity is a vector which is output by the second graph neural network and represents semantic information of the any text entity, the vector sum of initial relation characteristics of the second entity representation of the head text entity and entity relation in any triplet is equal to the second entity representation of the tail text entity, and the second entity representations of the text entities meet the constraint that the distance between the second entity representations of the matched text entities is smaller than a preset value.
According to the heterogeneous data processing method for multiple scenes provided by the invention, the determining of the association characteristic of any neighboring text entity transmitted to any text entity at the current network layer based on the graph entity characteristic of any neighboring text entity at the previous network layer and the graph relationship characteristic of the entity relationship between any neighboring text entity and any text entity at the previous network layer specifically comprises the following steps:
fusing graph entity characteristics of any neighbor text entity of any text entity at the upper network layer and graph relation characteristics of entity relations between any neighbor text entity and any text entity at the upper network layer to obtain fusion characteristics;
and weighting the fusion characteristic based on the direction of the entity relation between any neighbor text entity and any text entity and the weights corresponding to different directions to obtain the association characteristic of the text entity transmitted to any text entity at the current network layer by the any neighbor text entity.
According to the heterogeneous data processing method for multiple scenes provided by the invention, the updating obtains the graph relationship characteristic of the entity relationship between any neighbor text entity and any text entity at the current network layer, and the graph relationship characteristic comprises the following specific steps:
Based on a corresponding relation transformation matrix of various entity relations at a current network layer, carrying out feature transformation on graph relation features of the entity relation of any neighbor text entity and any text entity at the previous network layer to obtain graph relation features of the entity relation of any neighbor text entity and any text entity at the current network layer; the corresponding relation transformation matrix of the various entity relations in the current network layer is obtained through training.
According to the heterogeneous data processing method for multiple scenes provided by the invention, the structural topological feature of each text entity is determined based on the graph entity feature of each text entity at each network layer, and the method specifically comprises the following steps:
splicing the graph entity characteristics of any text entity in each network layer to obtain graph splicing entity characteristics of any text entity;
and carrying out linear transformation on the graph splicing entity characteristics of any text entity based on the linear transformation matrix to obtain the structural topological characteristics of any text entity.
According to the heterogeneous data processing method for multiple scenes provided by the invention, the determining the aligned entity pair of the first knowledge-graph and the second knowledge-graph based on the first entity pair Ji Gailv matrix specifically comprises the following steps:
After a second entity alignment probability matrix is constructed, fusing the first entity pair Ji Gailv matrix and the second entity alignment probability matrix to obtain an entity alignment probability fusion matrix, and determining alignment entity pairs of the first knowledge graph and the second knowledge graph based on the entity alignment probability fusion matrix;
the second entity alignment probability matrix is constructed based on similarity between a second entity representation of any text entity in the first knowledge-graph and a second entity representation of any text entity in the second knowledge-graph.
According to the heterogeneous data processing method for multiple scenes, the first graph neural network and the second graph neural network are obtained by combined training based on the following steps:
acquiring a seed alignment entity pair of the first knowledge-graph and the second knowledge-graph;
based on a first graph neural network, obtaining sample structure topological features of corresponding seed text entities from neighbor subgraphs of various seed text entities in the seed alignment entity pair;
based on a second graph neural network, acquiring sample relation features of entity relations connecting the seed text entities from triplets containing the seed text entities;
Fusing the sample structure topological feature of the seed text entity and the sample relation feature of the entity relation of the corresponding seed text entity to obtain a sample first entity representation of the seed text entity;
based on differences between sample structural topological features of two seed text entities aligned in a seed alignment entity pair, differences between sample relational features, and differences between sample first entity representations, parameters of the first and second graph neural networks are updated.
According to the heterogeneous data processing method for multiple scenes provided by the invention, the parameters of the first graph neural network and the second graph neural network are updated based on the differences among the topological features of the sample structures, the differences among the sample relation features and the differences among the first entity representations of the two seed text entities aligned in the seed alignment entity pair, and the method specifically comprises the following steps:
determining a first penalty based on differences between sample structural topological features of two seed text entities aligned in each seed alignment entity pair;
determining a second penalty based on differences between sample relationship features of two seed text entities aligned in the respective seed alignment entity pair;
Determining a third penalty based on differences between the sample first entity representations of the two seed text entities aligned in the respective seed alignment entity pair;
determining a fourth loss based on differences between graph stitching relationship features of each entity relationship output by the first graph neural network and initial relationship features of corresponding entity relationships output by the second graph neural network; the graph splicing relationship feature of any entity relationship is obtained by splicing graph relationship features of any entity relationship at each network layer of the first graph neural network;
updating parameters of the first and second graph neural networks based on the first, second, third, and fourth losses.
According to the heterogeneous data processing method for multiple scenes provided by the invention, the determining the aligned entity pair of the first knowledge-graph and the second knowledge-graph based on the first entity pair Ji Gailv matrix specifically comprises the following steps:
and searching the first entity alignment probability matrix based on a greedy matching algorithm, and determining alignment entity pairs of the first knowledge-graph and the second knowledge-graph.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a multi-scene heterogeneous data processing method according to any one of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a multi-scenario heterogeneous data processing method as described in any one of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements a multi-scenario heterogeneous data processing method as described in any one of the above.
According to the multi-scene heterogeneous data processing method provided by the invention, the structural topological characteristics of the text entity are obtained by extracting the graph structure information of the neighbor subgraphs of a certain text entity in the corresponding knowledge graph and utilizing the information of other text entities closely related to the text entity and the entity relation information of the text entity to help construct the semantic information of the text entity, so that the semantic extraction accuracy of the text entity is improved; on the other hand, the relation characteristics of entity relations connected with a certain text entity can be extracted from the triplets, the semantic scope of the text entity can be deduced or reduced by acquiring and inducing semantic information of all entity relations connected with the text entity, so that partial noise interference caused in the process of constructing the text entity semantic in the last step is removed, structural topological characteristics of each text entity in the first knowledge graph and the second knowledge graph and relation characteristics of entity relations connected with the corresponding text entity are fused, the first entity representation of each text entity is obtained, semantic information which is richer, more accurate and more complete for any text entity is acquired, the problem that semantic information extraction is inaccurate due to the isomerism of the knowledge graph is solved, the accuracy of entity alignment pairs obtained by entity alignment is higher, the fusion effect of knowledge graph data is better, and the problems of data redundancy and data conflict in the knowledge fusion process are effectively solved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a heterogeneous data processing method for multiple scenes;
FIG. 2 is a schematic flow chart of the method for extracting structural topological features;
FIG. 3 is a schematic flow chart of the associated feature extraction method provided by the invention;
FIG. 4 is a schematic flow chart of a method for extracting relational features provided by the invention;
FIG. 5 is a flow chart of the model training method provided by the invention;
fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flow chart of a heterogeneous data processing method for multiple scenes, as shown in fig. 1, where the method includes:
step 110, based on the first graph neural network, obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in the first knowledge graph and the second knowledge graph under different scenes.
Specifically, the first knowledge graph and the second knowledge graph are knowledge graphs constructed under different application scenes, for example, knowledge graphs of medical fields constructed by different hospitals. The two knowledge maps contain entities pointing to the same object, but the entity names of the entities pointing to the same object in different knowledge maps and the association relationship with other entities may be different, so that isomerism exists between the two. In order to avoid data redundancy and data collision during data fusion, entity alignment needs to be performed on the two knowledge maps. Any one of the knowledge maps is a graph structure data containing text information, wherein each node (i.e. text entity) corresponds to a text expressing a specific meaning in the field/application scene, such as a name of a person, a place, a mechanism, a proper noun, etc., and edges (i.e. entity relationships) between the nodes correspond to relationship names of entity relationships between two connected nodes. Thus, the entity alignment operation of the knowledge-graph is essentially a text matching operation that matches text entities expressing the same specific meaning in multiple knowledge-graphs.
Compared with the common text matching operation, the entity alignment operation in the knowledge graph has the particularity that the expression forms of the text entities are different, including heterogeneous phenomena such as different languages, different synonyms and the like, and the entity alignment is difficult to be performed by only considering the semantic information of the text entities (namely, the semantic information of the entity names). Therefore, the specific structural information of the graph structure of the knowledge graph can be utilized, and the semantic information of the text entity itself can be constructed by acquiring the graph structural information of the neighbor subgraph of a certain text entity in the corresponding knowledge graph and utilizing the information of other text entities closely related to the text entity and the entity relation information of the text entity, so that the semantic extraction accuracy of the text entity is improved. Wherein the structural topological feature of the corresponding text entity may be obtained from neighboring subgraphs of each text entity in the first and second knowledge-graph based on a first graph neural network, e.g. a graph convolutional neural network (Graph Convolution Neural Network, GCN). Here, the neighbor subgraph of any text entity is a subgraph formed by other text entities and the text entity together, wherein the other text entities and the text entity have a one-hop connection relationship in the knowledge graph where the neighbor subgraph is located.
By extracting the graph structure information in the neighbor subgraph of the text entity, including semantic information of other text entities and related information of other text entities and the text entity (including entity relationship type (i.e. entity relationship name), related direction (i.e. direction of edges in the directed graph), etc., the structural topological feature containing the graph structure information can be constructed and used as one of the representation modes of the text entity semantic information. Wherein the structural topology is characterized by a d-dimensional vector (s 1, s2,) sd.
Step 120, based on the second graph neural network, obtaining the relationship features of the entity relationship connecting each text entity from each triplet of the first knowledge graph and the second knowledge graph.
Specifically, in addition to the graph structure information in the neighbor subgraph, the entity relationship itself connecting a certain text entity can also assist in determining the semantic information of the text entity to some extent. In particular, in a complex knowledge graph, the same text entity may have multiple types of entity relationships with a plurality of other text entities, and by acquiring and summarizing semantic information of the entity relationships, the semantic meaning of the text entity can be deduced or the semantic meaning range of the text entity can be limited, so that part of noise interference brought in the semantic meaning construction process of the text entity in the last step is removed. Thus, the relationship features of the entity relationship connecting the respective text entities may be obtained by extracting the respective triples of the first knowledge-graph and the second knowledge-graph based on another graph neural network, such as transE (translating Embedding) model. The triplet includes a head text entity, a tail text entity and a relationship name of the entity relationship between the head text entity and the tail text entity, for example, the triplet < Einstein, created, relativity >. The relationship feature of the entity relationship connecting any text entity is a d-dimensional vector (r 1, r2, rd), which contains semantic information of all entity relationships connecting the text entity, and can be obtained by respectively extracting semantic information of relationship names of all entity relationships connecting the text entity and then fusing (e.g. splicing) the semantic information.
And step 130, fusing structural topological features of each text entity in the first knowledge graph and the second knowledge graph and relationship features of entity relationships connecting the corresponding text entities to obtain a first entity representation of each text entity.
Specifically, the results obtained in the two steps of constructing the semantic information of the text entity from different angles are fused to obtain the first entity representation of each text entity, so that the richer, more accurate and more complete semantic information aiming at any text entity can be obtained, and the problem of inaccurate semantic information extraction caused by the isomerism of the knowledge graph is solved. When the structural topological feature of any text entity and the relation feature of the entity relation connecting the text entity are fused, corresponding weights can be set for the structural topological feature and the relation feature respectively, and then weighted fusion is carried out to obtain a first entity representation (e 1, e2, ed.) of the text entity, so that the importance of a certain type of feature is highlighted. For example, the weight may be set according to the integrity of the knowledge graph, if the knowledge graph is complete, the weight of the structural topological feature may be greater, otherwise, the weight of the relational feature may be greater.
Step 140, constructing a first entity alignment probability matrix based on the similarity between the first entity representation of any text entity in the first knowledge-graph and the first entity representation of any text entity in the second knowledge-graph.
Specifically, the similarity between the first entity representation of any text entity in the first knowledge-graph and the first entity representation of any text entity in the second knowledge-graph may be calculated, for example, by using a cosine similarity calculation manner, so as to construct and obtain the first entity alignment probability matrix. Wherein the rows and columns of the first entity alignment probability matrix respectively correspond to each text entity of the first knowledge-graph and each text entity of the second knowledge-graph, and any matrix element represents a similarity between the first entity representation of the text entity of the corresponding row and the first entity representation of the text entity of the corresponding column, the higher the similarity, the more likely it is that the corresponding two text entities are aligned entities to each other.
Step 150, determining an alignment entity pair of the first knowledge-graph and the second knowledge-graph based on the first entity pair Ji Gailv matrix, and fusing data of the first knowledge-graph and the second knowledge-graph based on the alignment entity pair.
Specifically, a greedy matching algorithm may be used to search the first entity pair Ji Gailv matrix to find out that all text entities that are aligned entities form a plurality of aligned entity pairs. The two text entities included in any alignment entity pair are respectively from the first knowledge graph and the second knowledge graph, and have the same meaning, for example, the same disease, the same person, the same university and the like. Based on the alignment entity pair, a text entity in the first knowledge graph and a text entity which is an alignment entity with the text entity in the second knowledge graph can be combined, so that data fusion of the two knowledge graphs is realized.
According to the method provided by the embodiment of the invention, the graph structure information of the neighbor subgraphs of a certain text entity in the corresponding knowledge graph is extracted, and the semantic information of the text entity is constructed by utilizing the information of other text entities closely related to the text entity and the entity relation information of the text entity, so that the structural topological characteristic of the text entity is obtained, and the accuracy of text entity semantic extraction is improved; on the other hand, the relation characteristics of entity relations connected with a certain text entity can be extracted from the triplets, the semantic scope of the text entity can be deduced or reduced by acquiring and inducing semantic information of all entity relations connected with the text entity, so that partial noise interference caused in the process of constructing the text entity semantic in the last step is removed, structural topological characteristics of each text entity in the first knowledge graph and the second knowledge graph and relation characteristics of entity relations connected with the corresponding text entity are fused, the first entity representation of each text entity is obtained, semantic information which is richer, more accurate and more complete for any text entity is acquired, the problem that semantic information extraction is inaccurate due to the isomerism of the knowledge graph is solved, the accuracy of entity alignment pairs obtained by entity alignment is higher, the fusion effect of knowledge graph data is better, and the problems of data redundancy and data conflict in the knowledge fusion process are effectively solved.
Based on the above embodiment, step 110 specifically includes:
and a diagram structure information extraction step: for any text entity, determining association characteristics of any neighbor text entity transmitted to the any text entity at the current network layer based on graph entity characteristics of any neighbor text entity at the upper network layer of the any text entity and graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the upper network layer, fusing the association characteristics of each neighbor text entity transmitted to the any text entity at the current network layer to obtain the graph entity characteristics of the any text entity at the current network layer, and updating to obtain the graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the current network layer; the graph entity characteristics of any text entity at the first layer network layer and the graph relation characteristics of any entity relation at the first layer network layer are obtained by initialization;
iterative steps: and repeating the drawing structure information extraction step until drawing entity characteristics of each text entity at the last network layer are obtained, and determining the structure topological characteristics of each text entity based on the drawing entity characteristics of each text entity at each network layer.
Specifically, the first graph neural network is a multi-layer network structure, and each layer of network layer performs feature extraction operation on all text entities in a certain knowledge graph (the first knowledge graph or the second knowledge graph). As shown in fig. 2, assuming that the first neural network has K network layers in total, for the first network layer, the graph entity characteristics of any text entity at the layer and the graph relationship characteristics of any entity relationship at the layer are randomly initialized. The text entity comprises graph structure information of the text entity in a neighbor subgraph, which is extracted by the layer network layer of the graph neural network, in graph entity characteristics of any layer, and any entity relation comprises semantic information of relation names of the entity relation, which is extracted by the layer network layer of the graph neural network, in graph relation characteristics of any layer.
In the ith network layer, for any text entity, the associated feature of the neighbor text entity transferred to the text entity in the current network layer may be determined based on the graph entity feature of the last network layer of any neighbor text entity of the text entity (i.e., the text entity other than the text entity in the neighbor subgraph of the text entity) and the graph relationship feature of the last network layer of the entity relationship connecting the neighbor text entity and the text entity. Wherein different neighbor text entities are connected with the text entity in different types of entity relationships, so that semantic information provided by different neighbor text entities for the text entity should be different. In order to improve the accuracy of text entity semantic extraction, the original feature extraction thought of the graphic neural network can be improved, entity relations and semantic information of neighbor text entities are introduced to be fused and expanded, and semantic information which is related with the entity relations more greatly in the semantic information of the neighbor text entities is reinforced by utilizing the entity relations so as to distinguish the functions of the semantic information of different neighbor text entities on the text entity semantic construction. Therefore, the graph entity characteristics of any neighbor text entity in the upper network layer and the association characteristics obtained after the graph relation characteristics of the neighbor text entity and the text entity are fused contain semantic information which corresponds to the neighbor text entity and is more relevant to the entity relation of the neighbor text entity and the text entity.
Specifically, the manner of merging the graph structure feature of any neighboring text entity at the upper network layer and the graph relationship feature of the entity relationship between the neighboring text entity and the text entity at the upper network layer is shown in fig. 3, and the determining the association feature of any neighboring text entity transmitted to the any text entity at the current network layer based on the graph structure feature of any neighboring text entity at the upper network layer and the graph relationship feature of the entity relationship between any neighboring text entity and any text entity at the upper network layer specifically includes:
step 310, merging graph entity characteristics of any neighbor text entity of any text entity at the upper network layer and graph relation characteristics of entity relations between any neighbor text entity and any text entity at the upper network layer to obtain merged characteristics;
step 320, weighting the fusion feature based on the direction of the entity relationship between the any neighboring text entity and the any text entity and the weights corresponding to different directions, so as to obtain the association feature of the any neighboring text entity transmitted to the any text entity at the current network layer.
In particular, for knowledge graphs constructed as directed graphs, the entity relationships between text entities are directional, while the entity relationships in different directions, the connected text nodes may play different roles in locating the semantics of another connected text entity. Therefore, when determining the association feature of any neighbor text entity transferred to the text entity at the current network layer, the graph entity feature of any neighbor text entity of the text entity at the previous network layer and the graph relationship feature of the entity relationship connecting the neighbor text entity and the text entity at the previous network layer can be fused first to obtain the fusion feature. For example, the attention mechanism can be utilized to fuse the graph entity characteristics of the neighboring text entity at the upper network layer with the graph relationship characteristics of the corresponding entity relationship at the upper network layer, so as to strengthen the semantic information corresponding to the neighboring text entity and more relevant to the entity relationship. The graph entity features of the neighboring text entities on the previous network layer may be divided into a plurality of graph entity sub-features (less than 0) with vector dimensions identical to those of the graph entity features of the corresponding entity relationships on the previous network layer, and vector dot multiplication is performed on each graph entity sub-feature and the graph entity feature of the corresponding entity relationship on the previous network layer, so as to obtain vector inner products corresponding to each graph entity sub-feature. And inputting the vector inner products corresponding to the sub-features of each graph entity into a softmax function for calculation to obtain the attention weights corresponding to the sub-features of each graph entity, multiplying the sub-features of each graph entity by the attention weights corresponding to the sub-features of each graph entity, and then splicing the multiplied results to obtain the attention features of the graph entity features of the neighbor text entity on the upper network layer. The more relevant any graph entity sub-feature is in relation with the corresponding entity, the larger the attention weight is, and the larger the proportion of semantic information corresponding to the graph entity sub-feature in the whole attention feature is, so that the aim of strengthening the semantic information corresponding to the neighbor text entity and more relevant to the entity relation is fulfilled. And splicing the attention characteristic of the graph entity characteristic of the neighbor text entity at the upper network layer with the graph relationship characteristic of the corresponding entity relationship at the upper network layer to obtain a fusion characteristic.
And then, weighting the fusion characteristic according to the direction of the entity relation between the neighboring text entity and the weight corresponding to different predetermined directions to obtain the association characteristic of the neighboring text entity transmitted to the text entity at the current network layer, thereby improving the semantic expression capability of the association characteristic. The weights corresponding to different directions may be set according to an application scene or field in which the knowledge graph is located, or may be obtained by training the first graph neural network in a training process, which is not particularly limited in the embodiment of the present invention.
And then fusing the associated characteristics of each neighbor text entity of the text entity, which are transmitted to the text entity at the current network layer, to obtain the graph entity characteristics of the text entity at the current network layer. Here, the current network layer may fuse the associated features of each neighboring text entity of the text entity at the current network layer to the text entity using a linear transformation layer (e.g., a tanh activation function). The above operation is repeated in the network layer until the graph entity characteristics of all text entities of the knowledge graph input into the model in the current network layer are obtained. In addition, the graph relationship characteristics of the entity relationship at the current network layer are required to be updated at the current network layer, so that the graph relationship characteristics of the entity relationship at the current network layer are obtained, and the semantic expression capability of the graph relationship characteristics tends to be stronger.
Each network layer of the first graphic neural network iteratively executes the steps until the graphic entity characteristics of each text entity output by the last network layer are obtained. And then based on the graph entity characteristics of each text entity at each network layer, fusing to obtain the structural topology characteristics of the corresponding text entity.
Based on any one of the above embodiments, the updating obtains a graph relationship feature of the entity relationship between the any neighboring text entity and the any text entity at the current network layer, specifically including:
based on a corresponding relation transformation matrix of various entity relations at a current network layer, carrying out feature transformation on graph relation features of the entity relation of any neighbor text entity and any text entity at the previous network layer to obtain graph relation features of the entity relation of any neighbor text entity and any text entity at the current network layer; the corresponding relation transformation matrix of the various entity relations in the current network layer is obtained through training.
Specifically, in the current network layer, the graph relationship characteristics of the entity relationship in the previous network layer can be updated based on the relationship transformation matrix of the layer, so as to obtain the graph relationship characteristics of the entity relationship in the current network layer. The relation transformation matrix is learned in the training process of the first graph neural network, the relation transformation matrix of each network layer obtained through network training can extract semantic information from low-level to high-level and abstract from entity relations, and graph relation characteristics of each entity relation obtained through transformation in each network layer can be guaranteed to be in the same vector space.
Based on any one of the foregoing embodiments, the determining structural topology features of each text entity based on the graph entity features of each text entity at each network layer specifically includes:
splicing the graph entity characteristics of any text entity in each network layer to obtain graph splicing entity characteristics of any text entity;
and carrying out linear transformation on the graph splicing entity characteristics of any text entity based on the linear transformation matrix to obtain the structural topological characteristics of any text entity.
Specifically, splicing the graph entity characteristics of any text entity in each network layer to obtain the graph splicing entity characteristics of the text entity. The graph entity features of the text entity at each network layer can be connected end to end, so that graph splicing entity features of the text entity are spliced. And then, carrying out linear transformation on the graph splicing entity characteristics of the text entity based on the linear transformation matrix to realize conversion of vector space, thereby obtaining the structural topological characteristics of the text entity. Here, matrix multiplication calculation may be performed on the graph splicing entity feature of the text entity and the linear transformation matrix, to obtain the structural topology feature of the text entity. The linear transformation matrix may also be learned during the training process of the first graph neural network, that is, the parameters of the first graph neural network are updated by using a neural network training algorithm (for example, a gradient descent algorithm), and the matrix values of the linear transformation matrix are also updated at the same time, and the training manner of the first graph neural network will be described in detail later.
Based on any of the above embodiments, as shown in fig. 4, step 120 specifically includes:
step 121, obtaining a second entity representation of the head-tail text entity and an initial relationship feature of the entity relationship in each triplet of the first knowledge graph and the second knowledge graph based on a second graph neural network;
step 122, fusing the initial relationship features of the entity relationship connected with the same text entity to obtain the relationship features of the entity relationship connected with the same text entity;
the second entity representation of the first text entity in any triplet is equal to the vector sum of the initial relation characteristics of the entity relation and the second entity representation of the tail text entity, and the second entity representations of the text entities meet the constraint that the distance between the second entity representations of the matched text entities is smaller than a preset value.
Specifically, the second graph neural network is built based on a transition model, thereby having the capability of extracting entity embedding representations (entity embedding) and entity relationship embedding representations (relation embedding). Specifically, the first knowledge graph and the second knowledge graph may be input to a second graph neural network, and the second entity representation of the head-to-tail text entity and the initial relationship feature of the entity relationship in each triplet of the first knowledge graph and the second knowledge graph may be obtained based on the second graph neural network. The first text entity or the tail text entity is a text entity in the corresponding knowledge graph, the second entity representation of any text entity is that the first knowledge graph and the second knowledge graph are input into the second graph neural network, and d-dimensional vectors (e 1', e2',..ed ') output by the second graph neural network represent semantic information of the text entity extracted by the second graph neural network. The initial relation characteristic of the entity relation of the connecting head and tail text entities contains semantic information of the entity relation extracted by the second graph neural network.
Here, the first graph neural network (constructed based on GCN) and the second graph neural network (constructed based on transition) have different network structures, and the graph neural networks with different structures have different semantic extraction manners, so that semantic information of text entities and entity relations can be extracted from different angles, thereby realizing complementation between models.
For the second graph neural network, since it is built based on the transition model, a constraint is added in its training process: the first entity representation of the first text entity in any sample triplet and the vector sum of the initial relation characteristics of the entity relation are equal to the second entity representation of the last text entity, so that after training is finished, the second entity representations of the first text entity and the second text entity in each triplet of the first knowledge graph neural network and the second knowledge graph neural network extracted by the second graph neural network and the initial relation characteristics of the entity relation also meet the constraint. In addition, there is a constraint in the training process of the second graph neural network: i.e. the distance between the second entity representations of the seed text entities being alignment entities with respect to each other is smaller than a preset value. Therefore, after training, the second entity representation of the head-tail text entity in each triplet of the first knowledge-graph and the second knowledge-graph extracted by the second graph neural network also satisfies the above constraint.
Under the guidance of the two constraints, the second entity representation of each text entity and the initial relation feature of the entity relation extracted by the second graph neural network have higher semantic expression capability and contain accurate semantic information, and then the relation feature of the entity relation connected with the same text entity is obtained based on the fusion result of the initial relation feature of the entity relation connected with the same text entity, so that the relation feature can be effectively complemented with the first graph neural network for subsequent entity alignment.
Based on any one of the foregoing embodiments, the determining, based on the first entity pair Ji Gailv matrix, an aligned entity pair of the first knowledge-graph and the second knowledge-graph specifically includes:
after a second entity alignment probability matrix is constructed, fusing the first entity pair Ji Gailv matrix and the second entity alignment probability matrix to obtain an entity alignment probability fusion matrix, and determining alignment entity pairs of the first knowledge graph and the second knowledge graph based on the entity alignment probability fusion matrix;
the second entity alignment probability matrix is constructed based on similarity between a second entity representation of any text entity in the first knowledge-graph and a second entity representation of any text entity in the second knowledge-graph.
Specifically, in order to enhance the complementary effect between the first graph neural network and the second graph neural network, the fact that the second entity representation of each text entity extracted by the second graph neural network also has higher semantic expression capability and contains accurate semantic information is considered, so that the similarity between the second entity representation of any text entity in the first knowledge graph and the second entity representation of any text entity in the second knowledge graph can be calculated based on a vector similarity calculation mode (such as vector dot product), thereby constructing and obtaining a second entity pair Ji Gailv matrix, complementing the alignment probability matrix with the first entity, and improving the accuracy of entity alignment. Wherein, similar to the first entity alignment probability matrix, the rows and columns of the second entity alignment probability matrix also correspond to each text entity of the first knowledge-graph and each text entity of the second knowledge-graph, respectively, and any one of the matrix elements represents a similarity between the second entity representation of the text entity of the corresponding row and the second entity representation of the text entity of the corresponding column, the higher the similarity indicating that the corresponding two text entities are more likely to be aligned entities to each other.
After the second entity alignment probability matrix is constructed, the first entity pair Ji Gailv matrix and the second entity alignment probability matrix are fused, and an entity alignment probability fusion matrix is obtained. The two matrixes can be added or weighted to obtain an entity alignment probability fusion matrix, wherein the weights corresponding to the two matrixes during weighted addition can be determined according to the test performance of the two graph neural networks. And carrying out matching search based on the entity alignment probability fusion matrix, and determining alignment entity pairs in the first knowledge graph and the second knowledge graph.
Based on any of the above embodiments, as shown in fig. 5, the first graph neural network and the second graph neural network are obtained by performing joint training based on the following steps:
step 510, obtaining a seed alignment entity pair of the first knowledge-graph and the second knowledge-graph;
step 520, based on the first graph neural network, obtaining a sample structure topological feature of a corresponding seed text entity from neighbor subgraphs of each seed text entity in the seed alignment entity pair;
step 530, based on the second graph neural network, obtaining a sample relationship feature of the entity relationship connecting the seed text entities from the triplets containing the seed text entities;
step 540, merging the sample structure topological feature of the seed text entity and the sample relation feature of the entity relation of the corresponding seed text entity to obtain a sample first entity representation of the seed text entity;
step 550, updating parameters of the first and second graph neural networks based on differences between the topological features of the sample structures, differences between the relational features of the samples, and differences between the first entity representations of the samples of the two seed text entities aligned in the pair of seed alignment entities.
Specifically, by jointly training the first graph neural network and the second graph neural network, performance of the two networks can be improved by using a multi-task learning mode. Wherein the two networks can be trained in a supervised learning manner. Thus, a pair of seed alignment entities of the first and second knowledge-maps may be first acquired prior to training to iteratively train the two networks. The seed alignment entity pair includes two seed text entities manually labeled as alignment entities each other, one from the first knowledge-graph and one from the second knowledge-graph.
Then, on one hand, the first knowledge graph and the second knowledge graph are respectively input into a first graph neural network, and based on the first graph neural network, sample structure topological features of corresponding seed text entities are obtained from neighbor subgraphs of various seed text entities in various seed alignment entity pairs. On the other hand, the first knowledge graph and the second knowledge graph are respectively input into a second graph neural network, and based on the second graph neural network, sample relation features of entity relations connecting the seed text entities are obtained from triples containing the seed text entities. And fusing the sample structure topological characteristic of the seed text entity and the sample relation characteristic of the entity relation connected with the corresponding seed text entity to obtain a sample first entity representation of the seed text entity. The acquisition manners of the structural topological feature, the relation feature and the first entity representation of the sample are the same as those of the structural topological feature, the relation feature and the first entity representation given in the above embodiment, and are not repeated here.
Based on the differences between the topological features of the sample structures, the differences between the relationship features of the sample and the differences between the first entity representations of the sample, the parameters of the first and second neural networks are updated, and then the above steps are repeated until the two networks converge.
Based on any of the above embodiments, step 550 specifically includes:
determining a first penalty based on differences between sample structural topological features of two seed text entities aligned in each seed alignment entity pair;
determining a second penalty based on differences between sample relationship features of two seed text entities aligned in the respective seed alignment entity pair;
determining a third penalty based on differences between the sample first entity representations of the two seed text entities aligned in the respective seed alignment entity pair;
determining a fourth loss based on differences between graph stitching relationship features of each entity relationship output by the first graph neural network and initial relationship features of corresponding entity relationships output by the second graph neural network; the graph splicing relationship feature of any entity relationship is obtained by splicing graph relationship features of any entity relationship at each network layer of the first graph neural network;
Updating parameters of the first and second graph neural networks based on the first, second, third, and fourth losses.
Specifically, at each iteration of training, a first penalty is determined based on differences between sample structural topological features of two seed text entities that are alignment entities to each other. The first penalty may be used to constrain the sample structure topology features of two seed text entities that are aligned entities to each other so that they come closer together. The vector distance (such as manhattan distance or chebyshev distance) between the topological features of the sample structures of the two seed text entities aligned in the seed alignment entity pair can be calculated, and the average value of the vector distances of each seed alignment entity pair is calculated to obtain the first loss. A second penalty is determined based on differences between sample relational features of two seed text entities that are aligned entities to each other. The second penalty may be used to constrain the sample relationship features of two seed text entities that are aligned entities to each other to be progressively closer together. The vector distance (such as manhattan distance or chebyshev distance) between the sample relationship features of two seed text entities aligned in a seed alignment entity pair may be calculated, and an average of the vector distances of each seed alignment entity pair may be calculated to obtain the second loss. A third penalty is determined based on the difference between the sample first entity representations of the two seed text entities that are aligned entities to each other. A third penalty may be used to constrain the sample first entity representations of two seed text entities that are aligned entities to each other to be progressively closer together. Wherein a vector distance (e.g., manhattan distance or chebyshev distance) between the first entity representations of the samples of the two seed text entities aligned in the seed alignment entity pair may be calculated, and an average of the vector distances for each seed alignment entity pair may be taken to obtain a third loss.
In addition, in order to enhance the constraint between the first graph neural network and the second graph neural network, thereby improving the accuracy of the graph relationship features and the initial relationship features of the entity relationship in the first graph neural network and the second graph neural network, and determining the fourth loss based on the difference between the graph splicing relationship features of the entity relationship output by the first graph neural network and the initial relationship features of the corresponding entity relationship output by the second graph neural network. The graph splicing relationship feature of any entity relationship is obtained by splicing graph relationship features of the entity relationship in each network layer of the first graph neural network. And calculating vector distances (such as Manhattan distance or Chebyshev distance) between graph stitching relation features of any entity relation and initial relation features of corresponding entity relation output by the second graph neural network, and calculating an average value based on the vector distances corresponding to the entity relations to obtain fourth loss. The fourth loss can be used for constraining the graph splicing relation features and the initial relation features of the same entity relation to gradually approach each other, so that the two networks maintain consistency when extracting the semantic information of the same entity relation, and the capability of extracting the semantic information of the entity relation is improved. Subsequently, parameters of the first and second graph neural networks are updated based on the sum of the first, second, third, and fourth losses.
Fig. 6 is a schematic structural diagram of an electronic device according to the present invention, and as shown in fig. 6, the electronic device may include: processor 610, memory 620, communication interface (Communications Interface) 630, and communication bus 640, wherein processor 610, memory 620, and communication interface 630 communicate with each other via communication bus 640. Processor 610 may invoke logic instructions in memory 620 to perform a multi-scenario heterogeneous data processing method comprising: based on the first graph neural network, obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in a first knowledge graph and a second knowledge graph under different scenes; the first graph neural network is constructed based on a graph convolution neural network; based on a second graph neural network, acquiring relationship features of entity relationships connecting all text entities from all triples of the first knowledge graph and the second knowledge graph; the second graph neural network is constructed based on a transient model; the triplet comprises two text entities and an entity relation connecting the two text entities; fusing structural topological features of each text entity in the first knowledge graph and the second knowledge graph and relationship features of entity relationships connecting the corresponding text entities to obtain a first entity representation of each text entity; constructing a first entity pair Ji Gailv matrix based on similarity between the first entity representation of any text entity in the first knowledge-graph and the first entity representation of any text entity in the second knowledge-graph; determining an aligned entity pair of the first knowledge-graph and the second knowledge-graph based on the first entity pair Ji Gailv matrix, and fusing data of the first knowledge-graph and the second knowledge-graph based on the aligned entity pair; the obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in the first knowledge graph and the second knowledge graph under different scenes specifically comprises: and a diagram structure information extraction step: for any text entity, determining association characteristics of any neighbor text entity transmitted to the any text entity at the current network layer based on graph entity characteristics of any neighbor text entity at the upper network layer of the any text entity and graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the upper network layer, fusing the association characteristics of each neighbor text entity transmitted to the any text entity at the current network layer to obtain the graph entity characteristics of the any text entity at the current network layer, and updating to obtain the graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the current network layer; the graph entity characteristics of any text entity at the first layer network layer and the graph relation characteristics of any entity relation at the first layer network layer are obtained by initialization; iterative steps: repeating the drawing structure information extraction step until drawing entity characteristics of each text entity at the last network layer are obtained, and determining the structure topology characteristics of each text entity based on the drawing entity characteristics of each text entity at each network layer; the obtaining the relationship features of the entity relationship connecting each text entity from each triplet of the first knowledge graph and the second knowledge graph specifically includes: acquiring a second entity representation of the head-tail text entity and initial relationship characteristics of entity relationships in each triplet of the first knowledge graph and the second knowledge graph based on a second graph neural network; fusing initial relation features of entity relations connected with the same text entity to obtain relation features of the entity relations connected with the same text entity; the second entity representation of any text entity is a vector which is output by the second graph neural network and represents semantic information of the any text entity, the vector sum of initial relation characteristics of the second entity representation of the head text entity and entity relation in any triplet is equal to the second entity representation of the tail text entity, and the second entity representations of the text entities meet the constraint that the distance between the second entity representations of the matched text entities is smaller than a preset value.
Further, the logic instructions in the memory 620 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a multi-scenario heterogeneous data processing method provided by the above methods, the method comprising: based on the first graph neural network, obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in a first knowledge graph and a second knowledge graph under different scenes; the first graph neural network is constructed based on a graph convolution neural network; based on a second graph neural network, acquiring relationship features of entity relationships connecting all text entities from all triples of the first knowledge graph and the second knowledge graph; the second graph neural network is constructed based on a transient model; the triplet comprises two text entities and an entity relation connecting the two text entities; fusing structural topological features of each text entity in the first knowledge graph and the second knowledge graph and relationship features of entity relationships connecting the corresponding text entities to obtain a first entity representation of each text entity; constructing a first entity pair Ji Gailv matrix based on similarity between the first entity representation of any text entity in the first knowledge-graph and the first entity representation of any text entity in the second knowledge-graph; determining an aligned entity pair of the first knowledge-graph and the second knowledge-graph based on the first entity pair Ji Gailv matrix, and fusing data of the first knowledge-graph and the second knowledge-graph based on the aligned entity pair; the obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in the first knowledge graph and the second knowledge graph under different scenes specifically comprises: and a diagram structure information extraction step: for any text entity, determining association characteristics of any neighbor text entity transmitted to the any text entity at the current network layer based on graph entity characteristics of any neighbor text entity at the upper network layer of the any text entity and graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the upper network layer, fusing the association characteristics of each neighbor text entity transmitted to the any text entity at the current network layer to obtain the graph entity characteristics of the any text entity at the current network layer, and updating to obtain the graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the current network layer; the graph entity characteristics of any text entity at the first layer network layer and the graph relation characteristics of any entity relation at the first layer network layer are obtained by initialization; iterative steps: repeating the drawing structure information extraction step until drawing entity characteristics of each text entity at the last network layer are obtained, and determining the structure topology characteristics of each text entity based on the drawing entity characteristics of each text entity at each network layer; the obtaining the relationship features of the entity relationship connecting each text entity from each triplet of the first knowledge graph and the second knowledge graph specifically includes: acquiring a second entity representation of the head-tail text entity and initial relationship characteristics of entity relationships in each triplet of the first knowledge graph and the second knowledge graph based on a second graph neural network; fusing initial relation features of entity relations connected with the same text entity to obtain relation features of the entity relations connected with the same text entity; the second entity representation of any text entity is a vector which is output by the second graph neural network and represents semantic information of the any text entity, the vector sum of initial relation characteristics of the second entity representation of the head text entity and entity relation in any triplet is equal to the second entity representation of the tail text entity, and the second entity representations of the text entities meet the constraint that the distance between the second entity representations of the matched text entities is smaller than a preset value.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a multi-scenario heterogeneous data processing method provided above, the method comprising: based on the first graph neural network, obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in a first knowledge graph and a second knowledge graph under different scenes; the first graph neural network is constructed based on a graph convolution neural network; based on a second graph neural network, acquiring relationship features of entity relationships connecting all text entities from all triples of the first knowledge graph and the second knowledge graph; the second graph neural network is constructed based on a transient model; the triplet comprises two text entities and an entity relation connecting the two text entities; fusing structural topological features of each text entity in the first knowledge graph and the second knowledge graph and relationship features of entity relationships connecting the corresponding text entities to obtain a first entity representation of each text entity; constructing a first entity pair Ji Gailv matrix based on similarity between the first entity representation of any text entity in the first knowledge-graph and the first entity representation of any text entity in the second knowledge-graph; determining an aligned entity pair of the first knowledge-graph and the second knowledge-graph based on the first entity pair Ji Gailv matrix, and fusing data of the first knowledge-graph and the second knowledge-graph based on the aligned entity pair; the obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in the first knowledge graph and the second knowledge graph under different scenes specifically comprises: and a diagram structure information extraction step: for any text entity, determining association characteristics of any neighbor text entity transmitted to the any text entity at the current network layer based on graph entity characteristics of any neighbor text entity at the upper network layer of the any text entity and graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the upper network layer, fusing the association characteristics of each neighbor text entity transmitted to the any text entity at the current network layer to obtain the graph entity characteristics of the any text entity at the current network layer, and updating to obtain the graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the current network layer; the graph entity characteristics of any text entity at the first layer network layer and the graph relation characteristics of any entity relation at the first layer network layer are obtained by initialization; iterative steps: repeating the drawing structure information extraction step until drawing entity characteristics of each text entity at the last network layer are obtained, and determining the structure topology characteristics of each text entity based on the drawing entity characteristics of each text entity at each network layer; the obtaining the relationship features of the entity relationship connecting each text entity from each triplet of the first knowledge graph and the second knowledge graph specifically includes: acquiring a second entity representation of the head-tail text entity and initial relationship characteristics of entity relationships in each triplet of the first knowledge graph and the second knowledge graph based on a second graph neural network; fusing initial relation features of entity relations connected with the same text entity to obtain relation features of the entity relations connected with the same text entity; the second entity representation of any text entity is a vector which is output by the second graph neural network and represents semantic information of the any text entity, the vector sum of initial relation characteristics of the second entity representation of the head text entity and entity relation in any triplet is equal to the second entity representation of the tail text entity, and the second entity representations of the text entities meet the constraint that the distance between the second entity representations of the matched text entities is smaller than a preset value.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A heterogeneous data processing method for multiple scenes, comprising:
based on the first graph neural network, obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in a first knowledge graph and a second knowledge graph under different scenes; the first graph neural network is constructed based on a graph convolution neural network;
based on a second graph neural network, acquiring relationship features of entity relationships connecting all text entities from all triples of the first knowledge graph and the second knowledge graph; the second graph neural network is constructed based on a transient model; the triplet comprises two text entities and an entity relation connecting the two text entities;
Fusing structural topological features of each text entity in the first knowledge graph and the second knowledge graph and relationship features of entity relationships connecting the corresponding text entities to obtain a first entity representation of each text entity;
constructing a first entity pair Ji Gailv matrix based on similarity between the first entity representation of any text entity in the first knowledge-graph and the first entity representation of any text entity in the second knowledge-graph;
determining an aligned entity pair of the first knowledge-graph and the second knowledge-graph based on the first entity pair Ji Gailv matrix, and fusing data of the first knowledge-graph and the second knowledge-graph based on the aligned entity pair;
the obtaining structural topological features of the corresponding text entities from neighbor subgraphs of each text entity in the first knowledge graph and the second knowledge graph under different scenes specifically comprises:
and a diagram structure information extraction step: for any text entity, determining association characteristics of any neighbor text entity transmitted to the any text entity at the current network layer based on graph entity characteristics of any neighbor text entity at the upper network layer of the any text entity and graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the upper network layer, fusing the association characteristics of each neighbor text entity transmitted to the any text entity at the current network layer to obtain the graph entity characteristics of the any text entity at the current network layer, and updating to obtain the graph relationship characteristics of entity relationship between the any neighbor text entity and the any text entity at the current network layer; the graph entity characteristics of any text entity at the first layer network layer and the graph relation characteristics of any entity relation at the first layer network layer are obtained by initialization;
Iterative steps: repeating the drawing structure information extraction step until drawing entity characteristics of each text entity at the last network layer are obtained, and determining the structure topology characteristics of each text entity based on the drawing entity characteristics of each text entity at each network layer;
the obtaining the relationship features of the entity relationship connecting each text entity from each triplet of the first knowledge graph and the second knowledge graph specifically includes:
acquiring a second entity representation of the head-tail text entity and initial relationship characteristics of entity relationships in each triplet of the first knowledge graph and the second knowledge graph based on a second graph neural network;
fusing initial relation features of entity relations connected with the same text entity to obtain relation features of the entity relations connected with the same text entity;
the second entity representation of any text entity is a vector which is output by the second graph neural network and represents semantic information of the any text entity, the vector sum of initial relation characteristics of the second entity representation of the head text entity and entity relation in any triplet is equal to the second entity representation of the tail text entity, and the second entity representations of the text entities meet the constraint that the distance between the second entity representations of the matched text entities is smaller than a preset value.
2. The multi-scenario heterogeneous data processing method according to claim 1, wherein the determining, based on the graph entity feature of any neighboring text entity of the any text entity at the upper network layer and the graph relationship feature of the entity relationship between the any neighboring text entity and the any text entity at the upper network layer, the association feature of the any neighboring text entity transferred to the any text entity at the current network layer specifically includes:
fusing graph entity characteristics of any neighbor text entity of any text entity at the upper network layer and graph relation characteristics of entity relations between any neighbor text entity and any text entity at the upper network layer to obtain fusion characteristics;
and weighting the fusion characteristic based on the direction of the entity relation between any neighbor text entity and any text entity and the weights corresponding to different directions to obtain the association characteristic of the text entity transmitted to any text entity at the current network layer by the any neighbor text entity.
3. The multi-scenario heterogeneous data processing method according to claim 1, wherein the updating obtains graph relationship features of the entity relationship between the any neighboring text entity and the any text entity at the current network layer, specifically including:
Based on a corresponding relation transformation matrix of various entity relations at a current network layer, carrying out feature transformation on graph relation features of the entity relation of any neighbor text entity and any text entity at the previous network layer to obtain graph relation features of the entity relation of any neighbor text entity and any text entity at the current network layer; the corresponding relation transformation matrix of the various entity relations in the current network layer is obtained through training.
4. The multi-scenario heterogeneous data processing method according to claim 1, wherein the determining structural topological features of each text entity based on the graph entity features of each text entity at each network layer specifically comprises:
splicing the graph entity characteristics of any text entity in each network layer to obtain graph splicing entity characteristics of any text entity;
and carrying out linear transformation on the graph splicing entity characteristics of any text entity based on the linear transformation matrix to obtain the structural topological characteristics of any text entity.
5. The multi-scenario heterogeneous data processing method according to claim 1, wherein the determining, based on the first entity pair Ji Gailv matrix, an aligned entity pair of the first knowledge-graph and the second knowledge-graph specifically includes:
After a second entity alignment probability matrix is constructed, fusing the first entity pair Ji Gailv matrix and the second entity alignment probability matrix to obtain an entity alignment probability fusion matrix, and determining alignment entity pairs of the first knowledge graph and the second knowledge graph based on the entity alignment probability fusion matrix;
the second entity alignment probability matrix is constructed based on similarity between a second entity representation of any text entity in the first knowledge-graph and a second entity representation of any text entity in the second knowledge-graph.
6. The multi-scenario heterogeneous data processing method according to claim 1, wherein the first graph neural network and the second graph neural network are obtained by joint training based on the following steps:
acquiring a seed alignment entity pair of the first knowledge-graph and the second knowledge-graph;
based on a first graph neural network, obtaining sample structure topological features of corresponding seed text entities from neighbor subgraphs of various seed text entities in the seed alignment entity pair;
based on a second graph neural network, acquiring sample relation features of entity relations connecting the seed text entities from triplets containing the seed text entities;
Fusing the sample structure topological feature of the seed text entity and the sample relation feature of the entity relation of the corresponding seed text entity to obtain a sample first entity representation of the seed text entity;
based on differences between sample structural topological features of two seed text entities aligned in a seed alignment entity pair, differences between sample relational features, and differences between sample first entity representations, parameters of the first and second graph neural networks are updated.
7. The multi-scenario heterogeneous data processing method according to claim 6, wherein updating parameters of the first graph neural network and the second graph neural network based on differences between sample structural topological features, differences between sample relational features, and differences between sample first entity representations of two seed text entities aligned in a seed alignment entity pair, specifically comprises:
determining a first penalty based on differences between sample structural topological features of two seed text entities aligned in each seed alignment entity pair;
determining a second penalty based on differences between sample relationship features of two seed text entities aligned in the respective seed alignment entity pair;
Determining a third penalty based on differences between the sample first entity representations of the two seed text entities aligned in the respective seed alignment entity pair;
determining a fourth loss based on differences between graph stitching relationship features of each entity relationship output by the first graph neural network and initial relationship features of corresponding entity relationships output by the second graph neural network; the graph splicing relationship feature of any entity relationship is obtained by splicing graph relationship features of any entity relationship at each network layer of the first graph neural network;
updating parameters of the first and second graph neural networks based on the first, second, third, and fourth losses.
8. The multi-scenario heterogeneous data processing method according to any one of claims 1 to 4, wherein the determining, based on the first entity pair Ji Gailv matrix, an aligned entity pair of the first knowledge-graph and the second knowledge-graph specifically includes:
and searching the first entity alignment probability matrix based on a greedy matching algorithm, and determining alignment entity pairs of the first knowledge-graph and the second knowledge-graph.
CN202310416643.3A 2023-04-19 2023-04-19 Heterogeneous data processing method for multiple scenes Active CN116150405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310416643.3A CN116150405B (en) 2023-04-19 2023-04-19 Heterogeneous data processing method for multiple scenes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310416643.3A CN116150405B (en) 2023-04-19 2023-04-19 Heterogeneous data processing method for multiple scenes

Publications (2)

Publication Number Publication Date
CN116150405A CN116150405A (en) 2023-05-23
CN116150405B true CN116150405B (en) 2023-06-27

Family

ID=86354675

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310416643.3A Active CN116150405B (en) 2023-04-19 2023-04-19 Heterogeneous data processing method for multiple scenes

Country Status (1)

Country Link
CN (1) CN116150405B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559765A (en) * 2020-12-11 2021-03-26 中电科大数据研究院有限公司 Multi-source heterogeneous database semantic integration method
CN112988917A (en) * 2021-03-31 2021-06-18 东南大学 Entity alignment method based on multiple entity contexts
CN113641826A (en) * 2021-06-29 2021-11-12 北京邮电大学 Entity alignment method, device and system for multi-source knowledge graph fusion
CN113761221A (en) * 2021-06-30 2021-12-07 中国人民解放军32801部队 Knowledge graph entity alignment method based on graph neural network
CN114036307A (en) * 2021-09-17 2022-02-11 清华大学 Knowledge graph entity alignment method and device
CN115809340A (en) * 2022-08-29 2023-03-17 南京南瑞信息通信科技有限公司 Entity updating method and system of knowledge graph

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220261668A1 (en) * 2021-02-12 2022-08-18 Tempus Labs, Inc. Artificial intelligence engine for directed hypothesis generation and ranking

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559765A (en) * 2020-12-11 2021-03-26 中电科大数据研究院有限公司 Multi-source heterogeneous database semantic integration method
CN112988917A (en) * 2021-03-31 2021-06-18 东南大学 Entity alignment method based on multiple entity contexts
CN113641826A (en) * 2021-06-29 2021-11-12 北京邮电大学 Entity alignment method, device and system for multi-source knowledge graph fusion
WO2023273182A1 (en) * 2021-06-29 2023-01-05 北京邮电大学 Multi-source knowledge graph fusion-oriented entity alignment method and apparatus, and system
CN113761221A (en) * 2021-06-30 2021-12-07 中国人民解放军32801部队 Knowledge graph entity alignment method based on graph neural network
CN114036307A (en) * 2021-09-17 2022-02-11 清华大学 Knowledge graph entity alignment method and device
CN115809340A (en) * 2022-08-29 2023-03-17 南京南瑞信息通信科技有限公司 Entity updating method and system of knowledge graph

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Collective entity alignment for knowledge fusion of power grid dispatching knowledge graphs;Linyao Yang等;《IEEE/CAA Journal of Automatica Sinica》;第9卷(第11期);第1990-2004页 *
Neighborhood-aware attentional representation for multilingual knowledge graphs;Qiannan Zhu等;《Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019》;第1943-1949页 *
基于双向GCN和CVm的实体对齐模型研究;魏忠诚等;《计算机应用研究》;第38卷(第9期);第2716-2720页 *
融合属性嵌入与关系注意力的跨语言实体对齐;苏哲晗等;《云南大学学报(自然科学版)》;第1-10页 *

Also Published As

Publication number Publication date
CN116150405A (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN111612103B (en) Image description generation method, system and medium combined with abstract semantic representation
WO2018207723A1 (en) Abstract generation device, abstract generation method, and computer program
CN112487168B (en) Semantic question-answering method and device of knowledge graph, computer equipment and storage medium
CN111737535B (en) Network characterization learning method based on element structure and graph neural network
CN107391512B (en) Method and device for predicting knowledge graph
CN112116090B (en) Neural network structure searching method and device, computer equipment and storage medium
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN112528035B (en) Knowledge graph reasoning method and device based on relational attention and computer equipment
US20120158636A1 (en) Efficient probabilistic reasoning over semantic data
CN113449858A (en) Processing method of neural network model and related equipment
CN111382868A (en) Neural network structure search method and neural network structure search device
Lin et al. Deep structured scene parsing by learning with image descriptions
CN116049459B (en) Cross-modal mutual retrieval method, device, server and storage medium
CN111046671A (en) Chinese named entity recognition method based on graph network and merged into dictionary
CN112052668A (en) Training method of address text recognition model, and address prediction method and device
CN114528898A (en) Scene graph modification based on natural language commands
CN113705196A (en) Chinese open information extraction method and device based on graph neural network
Ramnath et al. Scene graph based image retrieval--a case study on the CLEVR dataset
KR20210064817A (en) Method for Transfer Learning between Different Deep Learning Models
CN113779190A (en) Event cause and effect relationship identification method and device, electronic equipment and storage medium
CN112883736A (en) Medical entity relationship extraction method and device
KR20200117690A (en) Method and Apparatus for Completing Knowledge Graph Based on Convolutional Learning Using Multi-Hop Neighborhoods
CN113705402A (en) Video behavior prediction method, system, electronic device and storage medium
CN113761151A (en) Synonym mining method, synonym mining device, synonym question answering method, synonym question answering device, computer equipment and storage medium
CN116150405B (en) Heterogeneous data processing method for multiple scenes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant